back to index

[Paper Club] Upcycling Large Language Models into Mixture of Experts


Whisper Transcript | Transcript Only Page

00:00:00.000 | >> My inter works too.
00:00:04.440 | >> Okay, cool. Today, I'm going to present a mixture of experts.
00:00:10.960 | I'm Ethan from NVIDIA.
00:00:13.160 | I work on scaling, LLMs, transformers.
00:00:18.340 | I might sound a bit muffled because I'm having a cold.
00:00:22.560 | Apologies for that.
00:00:25.000 | To this topic, I'm going to give a brief introduction on mix of experts or MOE first.
00:00:33.740 | I'm going to talk about Megatron Core MOE,
00:00:37.280 | like how we accelerate these MOEs and train and do inference efficiently.
00:00:44.400 | Finally, going to talk about the upcycling LLM into MOE,
00:00:49.640 | which is our recent paper.
00:00:54.440 | So the AI models are growing larger and larger.
00:00:59.600 | This is a rather old picture from 2021.
00:01:03.720 | Switch transformer is the first model that surpassed the one trillion model size.
00:01:09.900 | Before that, it's only hundreds of billion parameters.
00:01:14.520 | I think at that time,
00:01:16.840 | it's growing 10 times each year.
00:01:19.840 | It looks like it's slowing down recently.
00:01:23.760 | The question is, we only have so much compute,
00:01:30.480 | how can we make the model better without increasing the compute?
00:01:35.380 | From Noam, he said,
00:01:38.400 | "My unsubstantiated theory is that parameters are good for knowledge,
00:01:44.640 | and compute or the flop is good for intelligence."
00:01:49.440 | Whatever those term means.
00:01:51.960 | So MOE is a good way of growing
00:01:55.840 | the parameters or growing knowledge without increasing the compute.
00:02:01.680 | So what is MOE?
00:02:05.640 | Here is a very simple diagram from switch transformer.
00:02:11.520 | For the traditional LLMs transformers,
00:02:16.040 | you have the self-attention,
00:02:18.280 | the residual layer norm,
00:02:21.800 | and then you have a FFN layer,
00:02:25.280 | which is just a two-layer layer,
00:02:27.200 | and then the residual.
00:02:29.800 | MOEs transform the FFN layer into multiple copy of them.
00:02:36.680 | You see here, there are four FFN layers,
00:02:41.400 | and then each token selectively activate a few experts,
00:02:46.040 | which is selected by router.
00:02:48.680 | The router is simply a matrix multiplication,
00:02:52.160 | a learnable matrix to select one of the expert based on the input.
00:02:59.360 | The model size increase enhancing its capability,
00:03:03.400 | while the compute roughly remains the same as the original model.
00:03:08.880 | If we look into the MOE layer,
00:03:14.320 | it actually consists of several steps.
00:03:17.920 | It's more complicated than the original FFN layer.
00:03:23.560 | You can think of the original FFN layer as the third step computation,
00:03:30.080 | whereas the expert layer,
00:03:33.840 | which was the original FFN layer,
00:03:36.320 | is applied to the input token.
00:03:40.320 | The first step here is routing.
00:03:44.920 | Given the input token, for example,
00:03:47.680 | here you have like six tokens,
00:03:50.840 | the quick brown fox jumped over.
00:03:55.240 | It went over the router.
00:03:58.120 | Router is simply a matrix that is applied on these tokens.
00:04:02.720 | It generates the probabilities,
00:04:05.280 | and we will take the highest probability as the router selected.
00:04:13.680 | Here is the expert indices,
00:04:16.400 | are the experts which have the highest probability on these tokens.
00:04:23.240 | The second step is permutation.
00:04:26.560 | Given the token selected experts,
00:04:31.560 | you need to align these input features with the experts.
00:04:37.200 | All of the tokens that select expert zero,
00:04:42.120 | you need to arrange them into a single matrix
00:04:47.600 | that are only for experts zero,
00:04:51.800 | expert one, expert two.
00:04:54.840 | Depending on the capacity factor,
00:04:57.760 | some of the token might be dropped.
00:05:01.000 | The third step is the same as the original FFN layer,
00:05:04.720 | where you do the computation.
00:05:07.640 | After the computation,
00:05:09.440 | there will be an on-premier step
00:05:14.080 | where you need to arrange these tokens back to the original shape.
00:05:19.040 | Then the router probability is applied as a scaling factor
00:05:25.000 | on all of these expert features.
00:05:32.520 | Scaling this MOE during training is very challenging.
00:05:42.000 | First, the models are massive scales.
00:05:46.080 | Usually, for example, mixed row 8x7b,
00:05:49.400 | you increase the model parameters roughly by 6x or 7x.
00:05:55.880 | This puts substantial pressure on the memory usage.
00:06:01.040 | And the router dispatching also has overhead.
00:06:04.920 | In the previous slide,
00:06:07.560 | you will notice there's a permute on on-permute operator.
00:06:11.680 | That essentially increases the activation memory by two times
00:06:18.800 | because all of those need to be stored.
00:06:21.200 | If you have top-k routing,
00:06:23.840 | it would also increase the memory of the activation by k
00:06:28.200 | because the hidden states need to go to each expert
00:06:33.400 | and you need to duplicate it.
00:06:37.360 | And also reduce the jam efficiency
00:06:39.800 | because you need to do a loop over all the experts
00:06:42.760 | to do the jam separately.
00:06:46.480 | There's also an imbalance issue if all of the tokens
00:06:50.440 | go to one expert,
00:06:51.840 | other experts on other GPU would be idle.
00:06:58.160 | Now, let me introduce the Megatron Core MOE,
00:07:04.760 | which is how we accelerate these MOE models
00:07:08.480 | given these challenges.
00:07:13.240 | So Megatron LLAM and Megatron Core
00:07:16.520 | is an open-source library on GitHub available.
00:07:21.360 | We accelerate not only MOE and also all of the LLAMs,
00:07:28.240 | including like GPT, BERT, T5.
00:07:31.880 | I'm not sure if anyone is still using those now,
00:07:35.080 | but primarily GPT models.
00:07:37.360 | And inside the transformer layers,
00:07:41.440 | the attention is accelerated with all kinds of parallelism,
00:07:46.680 | including pipeline parallel, tensor parallel,
00:07:50.760 | also the parallelism and the MOEs are also accelerated.
00:07:55.920 | This is what we are primarily talking about today,
00:07:58.560 | Megatron Core MOE.
00:08:01.800 | On top, you have two customizable training loops.
00:08:06.640 | Megatron LLAM provides simple bare-bone training loop.
00:08:10.520 | And you can easily hack.
00:08:13.200 | And NEMO provide a high-level interface
00:08:17.080 | where you can just provide Pythonic configuration
00:08:20.520 | to train these models.
00:08:21.560 | In the Megatron Core MOE, we provide different approaches
00:08:29.960 | to accelerate these models.
00:08:32.240 | For the router, there are Oxloss and Sinkhorn.
00:08:36.760 | Basically, Oxloss is a token trace MOE.
00:08:42.160 | And Sinkhorn, without token dropping,
00:08:45.600 | it can usually use in expert trace.
00:08:50.120 | And the tokens dispatcher, they're
00:08:53.760 | in permute, unpermute for efficient memory saving.
00:08:57.760 | And for the expert, we have grouped MLP to accelerate this.
00:09:05.720 | So first, it's expert model parallel.
00:09:10.360 | This is available in Megatron Core MOE now.
00:09:18.240 | You can-- usually, you would put all of the experts
00:09:22.720 | on one single GPUs.
00:09:24.480 | And then you do a for loop over all of the experts
00:09:27.840 | to compute the result. But instead, we
00:09:30.920 | can put one expert on each GPU.
00:09:35.360 | This will release a lot of memory
00:09:37.840 | and also accelerate the training on the inference.
00:09:42.080 | So token dropping, the default we use
00:09:49.880 | is dropless, meaning all of the tokens
00:09:54.160 | can go to one expert and no tokens are dropped.
00:10:00.280 | We also support token dropping with padding,
00:10:05.960 | meaning give a set capacity factor, for example, four here.
00:10:11.560 | Each expert can, at max, accept four tokens.
00:10:15.960 | Tokens beyond that are going to be dropped.
00:10:20.040 | So accuracy-wise and efficiency-wise,
00:10:24.680 | there are a lot of discussion around here.
00:10:29.400 | A lot of the pre-training experiments
00:10:32.880 | shows that token dropping is very efficient
00:10:35.760 | and it doesn't impact performance.
00:10:38.480 | But in some of the downstream fine-tuning,
00:10:42.000 | people realize dropless is better.
00:10:44.000 | Maybe it's because of the domain shifts.
00:10:48.920 | The tokens are no longer balanced.
00:10:51.520 | So we provide both of the options.
00:10:57.160 | So recently, there are a lot of new MOEs
00:11:02.120 | that have increasing number of experts.
00:11:05.120 | This will cause a very huge overhead.
00:11:09.280 | For example, the DeepSeq v2 MOE, it has 160 experts
00:11:14.520 | and eight of them are active.
00:11:17.720 | If you think about the memory overhead,
00:11:22.760 | let's say first you would have the tokens.
00:11:27.160 | And these tokens need to go to different experts.
00:11:30.560 | And at this step, you would have a scatter and a copy operator.
00:11:35.520 | Depending on the number of the top k,
00:11:38.200 | say the top k is eight here, each of the hidden states
00:11:42.800 | would be copied eight times, which is a very huge overhead.
00:11:47.920 | And after the expert operation is done,
00:11:53.400 | there's another eight copy of it here.
00:11:56.280 | So we have a fused permutation operation
00:12:03.880 | available in Maxwell for MOE, where all of these operations
00:12:09.000 | are fused.
00:12:10.320 | Because the copy operation here, it has zero compute,
00:12:17.640 | but it could cause duplicated memory.
00:12:21.240 | If you have these fused operations,
00:12:24.880 | you can easily compute these features during backward pass
00:12:30.520 | while saving a lot of memory.
00:12:32.040 | Let's also look at the implementation of Maxwell 8x7
00:12:42.600 | on Hagen-Fitts transformer.
00:12:44.880 | You also notice in the expert operation there,
00:12:50.560 | you would iterate over all of the experts
00:12:53.240 | and compute each of the gem operation one by one.
00:12:59.400 | We found that this is very inefficient.
00:13:03.240 | Instead, we provide an interface to Catalyst group gem,
00:13:10.880 | where the Catalyst group gem groups all of the looping
00:13:16.480 | over experts and calculate gem into a single operation.
00:13:23.520 | Given any number of experts and any number of tokens,
00:13:29.920 | you can efficiently compute the output in one operation.
00:13:34.920 | And this is very efficient.
00:13:36.960 | So that's pretty much all of the optimizations for MOE
00:13:51.000 | in Matron-Core MOE.
00:13:52.640 | If any one of you have questions,
00:13:55.560 | I can first answer those questions
00:13:58.240 | and then go to MOE upcycling.
00:14:00.680 | I'm just curious, are these available for everyone
00:14:13.480 | to use in a way that attracts your attention?
00:14:16.640 | Or are these specific to the Megatron implementation?
00:14:22.880 | So you can import them as a standalone module
00:14:33.200 | and apply them to any of your--
00:14:37.680 | Core library, right?
00:14:40.200 | Yeah, you can import Megatron Core as a library
00:14:44.040 | and just use it in your network.
00:14:46.160 | But I think there are some caveats.
00:14:49.160 | So for example, if you want to use expert parallelism,
00:14:53.720 | you will need to also use Megatron's parallelism strategy
00:15:01.760 | to initialize a strategy first.
00:15:04.800 | But if you're using, for example, the group gem,
00:15:08.360 | I think you can just get away with any kind of network.
00:15:12.720 | You can combine it with Huggins phase transformer
00:15:15.800 | with any problem.
00:15:18.080 | This is just a PyTorch layer with a fused operator.
00:15:23.880 | You can just import it as a library.
00:15:26.320 | It's very standalone.
00:15:27.720 | Cool.
00:15:31.720 | Do you have any intuition as far as the knowledge
00:15:46.480 | contained in each expert?
00:15:49.720 | I guess previous to seeing this paper,
00:15:52.000 | I had assumed that maybe one expert was good at economics,
00:15:56.680 | another one was good at physics, and so forth.
00:15:59.120 | But this makes it seem like it's more token by token.
00:16:05.920 | So do you have any intuition around that?
00:16:09.240 | Yeah, so we haven't started in our research,
00:16:12.680 | but I saw a lot of research on interpretability
00:16:16.120 | of the MOE models.
00:16:19.280 | So unfortunately, people didn't find
00:16:24.320 | a significant interpretability inside these experts.
00:16:31.120 | One expert focused on math, the other focused on literature.
00:16:36.320 | I think the problem is that neural network hidden states
00:16:41.560 | are already very entangled.
00:16:44.560 | So one hidden states are in this kind of superposition
00:16:49.440 | where it can represent multiple different features.
00:16:54.960 | So it's very hard to tell which expert focus on which area.
00:17:00.080 | There are some evidence of specializations
00:17:04.720 | on early layers of the experts, for example,
00:17:08.440 | first and second layers.
00:17:10.160 | You'll find some expert focus on multiple tokens,
00:17:14.520 | some expert focus on single tokens, things like that.
00:17:20.760 | And there's also one pretty interesting research
00:17:23.520 | from Facebook.
00:17:26.440 | They do a specialized training of dense model first,
00:17:32.040 | then combine those dense experts specialized
00:17:36.760 | into different domain into a MOE.
00:17:40.520 | In that case, it still preserves some of the specializations.
00:17:46.600 | Thank you.
00:17:47.120 | Just a quick question.
00:17:52.800 | We talked about top T sampling to select the expert
00:17:56.400 | throughout.
00:17:57.400 | Is there any benefit to using maybe a top T sampling
00:18:00.200 | approach similar to how you would use top T sampling
00:18:03.960 | for selecting an expert as compared to top K?
00:18:09.000 | Top P?
00:18:10.360 | What do you mean top P?
00:18:12.520 | Considering a list of experts until they
00:18:16.080 | exceed a given probability cumulatively?
00:18:20.720 | Does that make sense?
00:18:22.120 | I see, yeah.
00:18:22.920 | So since the top--
00:18:27.040 | it will be dynamic, let's say.
00:18:30.160 | Sometimes it can select some more.
00:18:33.360 | Sometimes it selects less.
00:18:35.400 | I think that promotes a little more diversity, I've heard.
00:18:39.520 | Yeah.
00:18:41.400 | Yeah, I think that makes sense.
00:18:44.080 | That will create some difficulty in optimization.
00:18:49.320 | But I think another thing pretty exciting is the expert choice.
00:18:54.520 | You see, in expert choice model, the selection is reversed.
00:19:00.480 | So here, we talk about all of them.
00:19:03.000 | Usually, this is token choice, which means
00:19:08.560 | the token selects K experts.
00:19:11.680 | So each token always have, for example,
00:19:14.400 | two experts applied on it.
00:19:17.320 | But expert choice is another way where the experts select tokens.
00:19:23.800 | Each expert only selects K tokens constantly.
00:19:28.720 | So even though this is fixed, but from the token
00:19:32.960 | perspective, each token can have zero expert applied to it,
00:19:39.000 | or more than zero, or all of the experts applied to it.
00:19:42.720 | Yeah.
00:19:43.440 | In that case, you're either overloading the expert
00:19:45.520 | or not considering all the tokens.
00:19:47.280 | It's like a trade-off.
00:19:49.840 | Yeah, in fact, expert choice applies pretty well
00:19:53.600 | to vision models, because vision models do not have causal mask.
00:20:01.080 | Thank you so much.
00:20:09.040 | I guess I can go to the next section, upcycling MOEs.
00:20:15.400 | So if you are going to remember one thing, remember this.
00:20:26.240 | So you can upcycle your dense models
00:20:29.160 | into a mix of experts.
00:20:31.920 | By training these upcycled models,
00:20:34.200 | you can achieve better accuracy than simply training
00:20:37.560 | the dense model further for the same number of flops.
00:20:42.240 | The context is we have so many big dense models.
00:20:46.280 | For example, there's a Lama 405B.
00:20:49.840 | And from NVIDIA, we have Nemotron 340B.
00:20:53.680 | These models are huge.
00:20:56.280 | It's very expensive to retrain MOE variant of it.
00:21:00.840 | If we want to further improve these models,
00:21:05.320 | you can upcycle these models into MOE
00:21:09.680 | to achieve better accuracy.
00:21:12.240 | On other scaling experiments, we tried on 15B models upcycling
00:21:20.040 | and applied on 1 trillion tokens and achieved roughly about 5%
00:21:25.800 | improvement in terms of the validation loss
00:21:29.080 | and 4% improvement on MMLU.
00:21:33.320 | It's exciting because the original sparse upcycling
00:21:38.280 | paper found it difficult to scale
00:21:42.000 | beyond 1 billion parameters.
00:21:45.480 | We found there are several key factors to go beyond 1 billion.
00:21:55.400 | So this is a concept of MOE.
00:21:59.840 | Let's say you have the MLP in the original plane boat
00:22:05.320 | transformer.
00:22:07.040 | And here, this is a mixture of two experts.
00:22:11.360 | One of the experts is activated, so the flop
00:22:15.160 | is the same as the original model.
00:22:17.400 | Now the parameters is increased.
00:22:21.400 | To upcycle such a model, you do two things.
00:22:26.640 | First is to copy the MLP layer into a number of expert copies.
00:22:35.920 | Here is just two copies.
00:22:38.200 | And then you randomly initialize the router rates.
00:22:41.800 | Then you just train this model.
00:22:43.640 | Pretty straightforward, right?
00:22:47.320 | There is one caveat here.
00:22:50.960 | This model needs to perform the same as the original model
00:22:56.000 | in the first forward pass.
00:22:58.440 | Otherwise, it will lead to catastrophe forgetting.
00:23:01.800 | So the trick to maintain this feature
00:23:12.240 | is through the swapping of the top-k softmax operator.
00:23:17.680 | This is introduced in Mixture 8x7b.
00:23:21.840 | Let's say you have the MLP copied,
00:23:24.960 | and then you do the top-k first to select two experts.
00:23:35.000 | And then you do the softmax on top
00:23:38.200 | of the logits from the top-k router.
00:23:43.640 | In this way, because the softmax is
00:23:46.160 | applied to the top-k output, it always sum up to 1.
00:23:51.960 | Here I give example 0.7, 0.3.
00:23:55.560 | And then you add these two outputs together.
00:24:01.240 | In this way, because the MLP layer
00:24:03.840 | is exactly the same as the dense model,
00:24:06.520 | these two copies are just the same.
00:24:09.160 | So the model output is the same as the original dense model.
00:24:13.800 | This is a very important feature in upcycling,
00:24:18.280 | because the upcycled MLE model actually
00:24:22.760 | behaves exactly the same as dense without any training.
00:24:27.400 | And this will help stabilize the model
00:24:30.160 | and avoid catastrophe forgetting.
00:24:34.120 | But the problem here is we actually
00:24:38.120 | found the mixed-source approach didn't work as well as
00:24:41.120 | expected, because the original switch transformer from Google
00:24:48.440 | uses softmax and top-k for a reason.
00:24:52.040 | And because of upcycling, if you switch to top-k, then softmax,
00:24:57.080 | it actually hurts some performance.
00:25:01.480 | The difference is simply the swap of top-k
00:25:06.560 | and the softmax order.
00:25:09.960 | I've already explained how mixed-source did top-k then
00:25:14.800 | softmax.
00:25:16.160 | So in the original switch transformer paper,
00:25:19.240 | you apply softmax first.
00:25:21.680 | So the probability of all of the experts would sum up to 1.
00:25:28.360 | So if you apply top-k, the output no longer sum up to 1.
00:25:34.120 | Here, it's the example 0.4, 0.2.
00:25:38.080 | So this is smaller than the original output.
00:25:40.920 | If you just train this model naively on a large scale,
00:25:46.720 | the model would catastrophically forget.
00:25:49.800 | But on a smaller scale model like 1B or under,
00:25:53.680 | you can kind of get away with this problem,
00:25:56.040 | because small model adapts very fast.
00:25:58.640 | This is probably one of the reasons
00:26:00.560 | the original sparse upcycling paper didn't go
00:26:03.960 | beyond 1 billion parameters.
00:26:06.720 | So we found a very simple approach to solve this problem.
00:26:16.360 | Because the output scale is smaller
00:26:18.520 | than the original model, we can simply
00:26:22.640 | scale up the MLP output by the number of experts
00:26:28.880 | divided by top-k.
00:26:30.680 | For example, if it's mixed-row 8 by 7,
00:26:34.400 | we simply scale the MLP layer output by 4x.
00:26:39.360 | This will solve the problem of the scale.
00:26:42.480 | And the model still behaves the same as the original model.
00:26:45.720 | You can train the upcycle model normally.
00:26:51.120 | And then we found with this approach,
00:26:54.360 | it consistently outperforms the mixed-row approach.
00:27:00.280 | So we can get the benefit of the original three-transformer.
00:27:07.000 | A bit of intuition behind why softmax and top-k is better.
00:27:13.680 | Because if you apply softmax to all of the experts,
00:27:19.200 | the probability distribution is always
00:27:23.600 | measured on all of the experts.
00:27:26.080 | However, in the swap case, the probability distribution
00:27:31.880 | is only on two experts.
00:27:33.840 | And it's dynamic.
00:27:35.120 | It's harder for the model to learn.
00:27:37.880 | Additionally, if you only use top-1,
00:27:41.800 | this method will not work.
00:27:43.640 | Because if the softmax on one expert is always 1,
00:27:49.240 | there wouldn't be any gradient to learn.
00:27:52.520 | So next, let's go to fine-grained MLE.
00:28:05.720 | This is very popular in the most recent MLEs.
00:28:09.720 | For example, the Quain V2 uses 64 experts.
00:28:15.400 | And DeepSeq V2 uses 120, 128, or 160-something experts.
00:28:23.360 | Granularity uses more experts, but smaller ones.
00:28:27.000 | So this gives the flexibility of more combinations
00:28:32.000 | of different experts, so more representation power.
00:28:37.640 | For example, here, originally you
00:28:40.840 | have the BigSurf2 expert, and 1 is selected.
00:28:44.800 | Instead, you can expand the number of experts into 4.
00:28:52.120 | Each expert is smaller than before.
00:28:56.200 | So the compute is the same.
00:28:58.880 | And the parameters is also the same.
00:29:02.440 | Here, some notation here.
00:29:05.480 | E2 means expansion factor is 2.
00:29:09.680 | This is how many times the expert is copied.
00:29:15.320 | And G2 is granularity, or how many times
00:29:21.840 | the expert is segmented.
00:29:24.720 | So you can think of this like the expert is first
00:29:29.880 | copied into two copies here.
00:29:32.720 | And then each copy is segmented two times.
00:29:37.400 | T2 means how many experts we route to.
00:29:45.320 | Here, you route to two experts.
00:29:48.120 | So the flop is the same as the original one.
00:29:55.680 | So you would soon notice a problem
00:29:59.680 | if you upcycle such model.
00:30:02.600 | Let's say these two segments are not the same anymore,
00:30:10.000 | because you segment one expert into two shards.
00:30:18.440 | For the top two cases, if you select one shard two times
00:30:22.960 | and the other shard zero time, the output
00:30:26.160 | is no longer the same as the original MOE.
00:30:30.000 | And we mentioned that it's very important for upcycling
00:30:34.800 | to maintain the same forward pass
00:30:38.800 | as the original dense model.
00:30:42.280 | The solution here is also rather straightforward.
00:30:46.760 | So instead of randomly initialize a router,
00:30:50.520 | we would initialize half as a router, and then duplicate it.
00:30:56.200 | So this would ensure that the probability distribution would
00:31:01.080 | be the same in each virtual group, or in each shard group.
00:31:07.760 | Because if these are the same, the top case selection
00:31:11.040 | will be exactly the same, the highest score
00:31:17.240 | would be the same for these two groups.
00:31:20.640 | I say here the example is that I shard the MLP into two parts.
00:31:29.280 | The first, I need to select the orange exacting ones
00:31:34.040 | and the blue exacting ones.
00:31:36.800 | By duplicating the router base, this achieves the purpose.
00:31:44.880 | And the formula for scaling the weights are also the same,
00:31:51.480 | except there's an additional granularity factor.
00:31:55.480 | This is a bit complicated, so I'll skip this part.
00:32:00.480 | So for our experiment, we do our experiment
00:32:10.280 | on the 8 trillion tokens Nemotron trained on.
00:32:14.120 | The ablation study is on a smaller model, Nemotron2B.
00:32:18.840 | And the bigger model, we do it on 8 by 15B on 1 trillion
00:32:23.360 | tokens.
00:32:23.860 | And we found that the learning rate
00:32:29.440 | is the most important hyperparameter, as always
00:32:32.480 | in machine learning.
00:32:34.360 | And this is true also for upcycling.
00:32:37.120 | Learning rate is the most important parameter.
00:32:41.560 | So in the original sparse upcycling paper,
00:32:46.440 | the learning rate is taken from the minimum--
00:32:51.440 | the ending learning rate from pre-training.
00:32:54.080 | So sometimes it can be rather small.
00:32:57.760 | So we found that if you have a high learning rate,
00:33:04.120 | it could help the upcycling a lot.
00:33:07.720 | Here is the orange line is the lowest learning rate.
00:33:11.600 | Basically, you continue to fine-tune the model into MOE.
00:33:18.440 | This learning rate is typical, for example,
00:33:21.520 | for other tasks like alignment.
00:33:24.360 | But for upcycling, the model needs
00:33:28.240 | to adapt to a new local minima.
00:33:31.520 | We need a larger learning rate for this.
00:33:34.400 | We found that the best is to use the original highest peak
00:33:39.800 | learning rate from pre-training, which works the best.
00:33:43.120 | And if you dive into the base of this upcycled model,
00:33:53.600 | you will find something really interesting.
00:33:57.680 | So if you apply a constant small learning rate,
00:34:02.560 | like a fine-tuning or alignment, the constant similarity
00:34:08.640 | between the base model and the upcycled model
00:34:12.440 | would be almost 1.
00:34:14.600 | This is true for most of the aligned model,
00:34:18.160 | for example, LamaChat versus LamaBase.
00:34:23.120 | If you use a high peak learning rate for upcycling,
00:34:27.520 | the constant similarity would be much lower, around 0.7.
00:34:34.560 | Also, we also analyzed the mixed-row 8x7 base
00:34:42.320 | versus mixed-row 7B.
00:34:44.800 | We found that the similarity is also
00:34:47.560 | around there, which means you need higher learning
00:34:52.720 | rates for upcycling.
00:34:57.160 | An additional experiment on number of experts,
00:35:02.760 | we found 64 experts is kind of like the sweet spot.
00:35:08.160 | If you increase the number of experts beyond 64,
00:35:12.320 | it provides diminishing return.
00:35:14.760 | Finally, this is the large-scale upcycling.
00:35:24.760 | We upcycled the 8x15B model on 1 trillion tokens.
00:35:30.000 | So there are three models here.
00:35:38.520 | Let me explain.
00:35:39.480 | The base model, 15B, is trained on 8 trillion tokens.
00:35:44.520 | This is pre-training data.
00:35:46.160 | So validation loss, 1.6, and the MMLU, 59.
00:35:52.320 | So the continual training model, it
00:35:56.360 | target more on the academic benchmarks
00:36:01.960 | to obtain higher performance on the evaluations.
00:36:07.320 | This is roughly 1 trillion tokens.
00:36:09.680 | The upcycling is performed on the same data for comparison.
00:36:14.480 | Continuous training is just a dense model of continuous
00:36:17.480 | training.
00:36:18.720 | You will notice that actually, the data plays the biggest
00:36:23.520 | factor.
00:36:25.200 | Even the base model continuous training,
00:36:30.600 | it can have a huge boost on MMLU.
00:36:34.960 | And the upcycled model is another 4% to 5% improvement
00:36:40.160 | on top of that.
00:36:41.760 | Continuous training is like 20% improvement because of data.
00:36:47.200 | This reminds us, again, data is the most important in ML.
00:36:51.880 | If you put the 5% improvement into the scaling law
00:37:03.080 | perspective, we can roughly gauge
00:37:06.360 | how much the 5% improvement means.
00:37:10.920 | Actually, the fine-grained MOE, the upcycled,
00:37:16.280 | this has the same plot as the dense model.
00:37:20.600 | And this has about 4% improvement
00:37:24.200 | in terms of the loss.
00:37:26.000 | 4% improvement if you plug in the scaling law from point AI,
00:37:30.600 | it's roughly represent 1.7x bigger model.
00:37:36.480 | And the non-fine-grained model, well, top two,
00:37:43.680 | this is the same config as the 8x7 mixed row.
00:37:48.800 | It has increased flops because of top two.
00:37:53.200 | That's roughly 1.7x more flops.
00:37:56.560 | And it's roughly two times as powerful as the original model.
00:38:03.280 | Given that we only spent like 1/8
00:38:05.880 | of the original pre-training compute,
00:38:10.080 | this is indeed some saving compared to training
00:38:14.760 | these MOEs from scratch.
00:38:20.040 | Thank you for listening.
00:38:21.440 | I put the paper links, Microsoft Core MOE GitHub there
00:38:26.920 | and Nevo GitHub where we provide high-level training interface.
00:38:32.720 | You can also follow me on LinkedIn and Twitter.
00:38:37.560 | I can take questions now.
00:38:40.720 | Hey, nice to see you again.
00:38:43.720 | There are a whole bunch of questions.
00:38:45.200 | I think we'll stop the recording so we
00:38:46.760 | can open up for questions.
00:38:48.280 | But I think everyone is very excited by the presentations.
00:38:52.280 | There's a lot of questions.
00:38:53.400 | And also people want your slides.
00:38:55.120 | So let people know where to get their slides.
00:38:57.840 | Yeah, sure.
00:38:58.440 | I can share the slides.
00:39:00.680 | I can share this slide.