[Paper Club] Upcycling Large Language Models into Mixture of Experts

00:00:00.000 | >> My inter works too.

00:00:04.440 | >> Okay, cool. Today, I'm going to present a mixture of experts.

00:00:10.960 | I'm Ethan from NVIDIA.

00:00:13.160 | I work on scaling, LLMs, transformers.

00:00:18.340 | I might sound a bit muffled because I'm having a cold.

00:00:22.560 | Apologies for that.

00:00:25.000 | To this topic, I'm going to give a brief introduction on mix of experts or MOE first.

00:00:33.740 | I'm going to talk about Megatron Core MOE,

00:00:37.280 | like how we accelerate these MOEs and train and do inference efficiently.

00:00:44.400 | Finally, going to talk about the upcycling LLM into MOE,

00:00:49.640 | which is our recent paper.

00:00:54.440 | So the AI models are growing larger and larger.

00:00:59.600 | This is a rather old picture from 2021.

00:01:03.720 | Switch transformer is the first model that surpassed the one trillion model size.

00:01:09.900 | Before that, it's only hundreds of billion parameters.

00:01:14.520 | I think at that time,

00:01:16.840 | it's growing 10 times each year.

00:01:19.840 | It looks like it's slowing down recently.

00:01:23.760 | The question is, we only have so much compute,

00:01:30.480 | how can we make the model better without increasing the compute?

00:01:35.380 | From Noam, he said,

00:01:38.400 | "My unsubstantiated theory is that parameters are good for knowledge,

00:01:44.640 | and compute or the flop is good for intelligence."

00:01:49.440 | Whatever those term means.

00:01:51.960 | So MOE is a good way of growing

00:01:55.840 | the parameters or growing knowledge without increasing the compute.

00:02:01.680 | So what is MOE?

00:02:05.640 | Here is a very simple diagram from switch transformer.

00:02:11.520 | For the traditional LLMs transformers,

00:02:16.040 | you have the self-attention,

00:02:18.280 | the residual layer norm,

00:02:21.800 | and then you have a FFN layer,

00:02:25.280 | which is just a two-layer layer,

00:02:27.200 | and then the residual.

00:02:29.800 | MOEs transform the FFN layer into multiple copy of them.

00:02:36.680 | You see here, there are four FFN layers,

00:02:41.400 | and then each token selectively activate a few experts,

00:02:46.040 | which is selected by router.

00:02:48.680 | The router is simply a matrix multiplication,

00:02:52.160 | a learnable matrix to select one of the expert based on the input.

00:02:59.360 | The model size increase enhancing its capability,

00:03:03.400 | while the compute roughly remains the same as the original model.

00:03:08.880 | If we look into the MOE layer,

00:03:14.320 | it actually consists of several steps.

00:03:17.920 | It's more complicated than the original FFN layer.

00:03:23.560 | You can think of the original FFN layer as the third step computation,

00:03:30.080 | whereas the expert layer,

00:03:33.840 | which was the original FFN layer,

00:03:36.320 | is applied to the input token.

00:03:40.320 | The first step here is routing.

00:03:44.920 | Given the input token, for example,

00:03:47.680 | here you have like six tokens,

00:03:50.840 | the quick brown fox jumped over.

00:03:55.240 | It went over the router.

00:03:58.120 | Router is simply a matrix that is applied on these tokens.

00:04:02.720 | It generates the probabilities,

00:04:05.280 | and we will take the highest probability as the router selected.

00:04:13.680 | Here is the expert indices,

00:04:16.400 | are the experts which have the highest probability on these tokens.

00:04:23.240 | The second step is permutation.

00:04:26.560 | Given the token selected experts,

00:04:31.560 | you need to align these input features with the experts.

00:04:37.200 | All of the tokens that select expert zero,

00:04:42.120 | you need to arrange them into a single matrix

00:04:47.600 | that are only for experts zero,

00:04:51.800 | expert one, expert two.

00:04:54.840 | Depending on the capacity factor,

00:04:57.760 | some of the token might be dropped.

00:05:01.000 | The third step is the same as the original FFN layer,

00:05:04.720 | where you do the computation.

00:05:07.640 | After the computation,

00:05:09.440 | there will be an on-premier step

00:05:14.080 | where you need to arrange these tokens back to the original shape.

00:05:19.040 | Then the router probability is applied as a scaling factor

00:05:25.000 | on all of these expert features.

00:05:32.520 | Scaling this MOE during training is very challenging.

00:05:42.000 | First, the models are massive scales.

00:05:46.080 | Usually, for example, mixed row 8x7b,

00:05:49.400 | you increase the model parameters roughly by 6x or 7x.

00:05:55.880 | This puts substantial pressure on the memory usage.

00:06:01.040 | And the router dispatching also has overhead.

00:06:04.920 | In the previous slide,

00:06:07.560 | you will notice there's a permute on on-permute operator.

00:06:11.680 | That essentially increases the activation memory by two times

00:06:18.800 | because all of those need to be stored.

00:06:21.200 | If you have top-k routing,

00:06:23.840 | it would also increase the memory of the activation by k

00:06:28.200 | because the hidden states need to go to each expert

00:06:33.400 | and you need to duplicate it.

00:06:37.360 | And also reduce the jam efficiency

00:06:39.800 | because you need to do a loop over all the experts

00:06:42.760 | to do the jam separately.

00:06:46.480 | There's also an imbalance issue if all of the tokens

00:06:50.440 | go to one expert,

00:06:51.840 | other experts on other GPU would be idle.

00:06:58.160 | Now, let me introduce the Megatron Core MOE,

00:07:04.760 | which is how we accelerate these MOE models

00:07:08.480 | given these challenges.

00:07:13.240 | So Megatron LLAM and Megatron Core

00:07:16.520 | is an open-source library on GitHub available.

00:07:21.360 | We accelerate not only MOE and also all of the LLAMs,

00:07:28.240 | including like GPT, BERT, T5.

00:07:31.880 | I'm not sure if anyone is still using those now,

00:07:35.080 | but primarily GPT models.

00:07:37.360 | And inside the transformer layers,

00:07:41.440 | the attention is accelerated with all kinds of parallelism,

00:07:46.680 | including pipeline parallel, tensor parallel,

00:07:50.760 | also the parallelism and the MOEs are also accelerated.

00:07:55.920 | This is what we are primarily talking about today,

00:07:58.560 | Megatron Core MOE.

00:08:01.800 | On top, you have two customizable training loops.

00:08:06.640 | Megatron LLAM provides simple bare-bone training loop.

00:08:10.520 | And you can easily hack.

00:08:13.200 | And NEMO provide a high-level interface

00:08:17.080 | where you can just provide Pythonic configuration

00:08:20.520 | to train these models.

00:08:21.560 | In the Megatron Core MOE, we provide different approaches

00:08:29.960 | to accelerate these models.

00:08:32.240 | For the router, there are Oxloss and Sinkhorn.

00:08:36.760 | Basically, Oxloss is a token trace MOE.

00:08:42.160 | And Sinkhorn, without token dropping,

00:08:45.600 | it can usually use in expert trace.

00:08:50.120 | And the tokens dispatcher, they're

00:08:53.760 | in permute, unpermute for efficient memory saving.

00:08:57.760 | And for the expert, we have grouped MLP to accelerate this.

00:09:05.720 | So first, it's expert model parallel.

00:09:10.360 | This is available in Megatron Core MOE now.

00:09:18.240 | You can-- usually, you would put all of the experts

00:09:22.720 | on one single GPUs.

00:09:24.480 | And then you do a for loop over all of the experts

00:09:27.840 | to compute the result. But instead, we

00:09:30.920 | can put one expert on each GPU.

00:09:35.360 | This will release a lot of memory

00:09:37.840 | and also accelerate the training on the inference.

00:09:42.080 | So token dropping, the default we use

00:09:49.880 | is dropless, meaning all of the tokens

00:09:54.160 | can go to one expert and no tokens are dropped.

00:10:00.280 | We also support token dropping with padding,

00:10:05.960 | meaning give a set capacity factor, for example, four here.

00:10:11.560 | Each expert can, at max, accept four tokens.

00:10:15.960 | Tokens beyond that are going to be dropped.

00:10:20.040 | So accuracy-wise and efficiency-wise,

00:10:24.680 | there are a lot of discussion around here.

00:10:29.400 | A lot of the pre-training experiments

00:10:32.880 | shows that token dropping is very efficient

00:10:35.760 | and it doesn't impact performance.

00:10:38.480 | But in some of the downstream fine-tuning,

00:10:42.000 | people realize dropless is better.

00:10:44.000 | Maybe it's because of the domain shifts.

00:10:48.920 | The tokens are no longer balanced.

00:10:51.520 | So we provide both of the options.

00:10:57.160 | So recently, there are a lot of new MOEs

00:11:02.120 | that have increasing number of experts.

00:11:05.120 | This will cause a very huge overhead.

00:11:09.280 | For example, the DeepSeq v2 MOE, it has 160 experts

00:11:14.520 | and eight of them are active.

00:11:17.720 | If you think about the memory overhead,

00:11:22.760 | let's say first you would have the tokens.

00:11:27.160 | And these tokens need to go to different experts.

00:11:30.560 | And at this step, you would have a scatter and a copy operator.

00:11:35.520 | Depending on the number of the top k,

00:11:38.200 | say the top k is eight here, each of the hidden states

00:11:42.800 | would be copied eight times, which is a very huge overhead.

00:11:47.920 | And after the expert operation is done,

00:11:53.400 | there's another eight copy of it here.

00:11:56.280 | So we have a fused permutation operation

00:12:03.880 | available in Maxwell for MOE, where all of these operations

00:12:09.000 | are fused.

00:12:10.320 | Because the copy operation here, it has zero compute,

00:12:17.640 | but it could cause duplicated memory.

00:12:21.240 | If you have these fused operations,

00:12:24.880 | you can easily compute these features during backward pass

00:12:30.520 | while saving a lot of memory.

00:12:32.040 | Let's also look at the implementation of Maxwell 8x7

00:12:42.600 | on Hagen-Fitts transformer.

00:12:44.880 | You also notice in the expert operation there,

00:12:50.560 | you would iterate over all of the experts

00:12:53.240 | and compute each of the gem operation one by one.

00:12:59.400 | We found that this is very inefficient.

00:13:03.240 | Instead, we provide an interface to Catalyst group gem,

00:13:10.880 | where the Catalyst group gem groups all of the looping

00:13:16.480 | over experts and calculate gem into a single operation.

00:13:23.520 | Given any number of experts and any number of tokens,

00:13:29.920 | you can efficiently compute the output in one operation.

00:13:34.920 | And this is very efficient.

00:13:36.960 | So that's pretty much all of the optimizations for MOE

00:13:51.000 | in Matron-Core MOE.

00:13:52.640 | If any one of you have questions,

00:13:55.560 | I can first answer those questions

00:13:58.240 | and then go to MOE upcycling.

00:14:00.680 | I'm just curious, are these available for everyone

00:14:13.480 | to use in a way that attracts your attention?

00:14:16.640 | Or are these specific to the Megatron implementation?

00:14:22.880 | So you can import them as a standalone module

00:14:33.200 | and apply them to any of your--

00:14:37.680 | Core library, right?

00:14:40.200 | Yeah, you can import Megatron Core as a library

00:14:44.040 | and just use it in your network.

00:14:46.160 | But I think there are some caveats.

00:14:49.160 | So for example, if you want to use expert parallelism,

00:14:53.720 | you will need to also use Megatron's parallelism strategy

00:15:01.760 | to initialize a strategy first.

00:15:04.800 | But if you're using, for example, the group gem,

00:15:08.360 | I think you can just get away with any kind of network.

00:15:12.720 | You can combine it with Huggins phase transformer

00:15:15.800 | with any problem.

00:15:18.080 | This is just a PyTorch layer with a fused operator.

00:15:23.880 | You can just import it as a library.

00:15:26.320 | It's very standalone.

00:15:27.720 | Cool.

00:15:31.720 | Do you have any intuition as far as the knowledge

00:15:46.480 | contained in each expert?

00:15:49.720 | I guess previous to seeing this paper,

00:15:52.000 | I had assumed that maybe one expert was good at economics,

00:15:56.680 | another one was good at physics, and so forth.

00:15:59.120 | But this makes it seem like it's more token by token.

00:16:05.920 | So do you have any intuition around that?

00:16:09.240 | Yeah, so we haven't started in our research,

00:16:12.680 | but I saw a lot of research on interpretability

00:16:16.120 | of the MOE models.

00:16:19.280 | So unfortunately, people didn't find

00:16:24.320 | a significant interpretability inside these experts.

00:16:31.120 | One expert focused on math, the other focused on literature.

00:16:36.320 | I think the problem is that neural network hidden states

00:16:41.560 | are already very entangled.

00:16:44.560 | So one hidden states are in this kind of superposition

00:16:49.440 | where it can represent multiple different features.

00:16:54.960 | So it's very hard to tell which expert focus on which area.

00:17:00.080 | There are some evidence of specializations

00:17:04.720 | on early layers of the experts, for example,

00:17:08.440 | first and second layers.

00:17:10.160 | You'll find some expert focus on multiple tokens,

00:17:14.520 | some expert focus on single tokens, things like that.

00:17:20.760 | And there's also one pretty interesting research

00:17:23.520 | from Facebook.

00:17:26.440 | They do a specialized training of dense model first,

00:17:32.040 | then combine those dense experts specialized

00:17:36.760 | into different domain into a MOE.

00:17:40.520 | In that case, it still preserves some of the specializations.

00:17:46.600 | Thank you.

00:17:47.120 | Just a quick question.

00:17:52.800 | We talked about top T sampling to select the expert

00:17:56.400 | throughout.

00:17:57.400 | Is there any benefit to using maybe a top T sampling

00:18:00.200 | approach similar to how you would use top T sampling

00:18:03.960 | for selecting an expert as compared to top K?

00:18:09.000 | Top P?

00:18:10.360 | What do you mean top P?

00:18:12.520 | Considering a list of experts until they

00:18:16.080 | exceed a given probability cumulatively?

00:18:20.720 | Does that make sense?

00:18:22.120 | I see, yeah.

00:18:22.920 | So since the top--

00:18:27.040 | it will be dynamic, let's say.

00:18:30.160 | Sometimes it can select some more.

00:18:33.360 | Sometimes it selects less.

00:18:35.400 | I think that promotes a little more diversity, I've heard.

00:18:39.520 | Yeah.

00:18:41.400 | Yeah, I think that makes sense.

00:18:44.080 | That will create some difficulty in optimization.

00:18:49.320 | But I think another thing pretty exciting is the expert choice.

00:18:54.520 | You see, in expert choice model, the selection is reversed.

00:19:00.480 | So here, we talk about all of them.

00:19:03.000 | Usually, this is token choice, which means

00:19:08.560 | the token selects K experts.

00:19:11.680 | So each token always have, for example,

00:19:14.400 | two experts applied on it.

00:19:17.320 | But expert choice is another way where the experts select tokens.

00:19:23.800 | Each expert only selects K tokens constantly.

00:19:28.720 | So even though this is fixed, but from the token

00:19:32.960 | perspective, each token can have zero expert applied to it,

00:19:39.000 | or more than zero, or all of the experts applied to it.

00:19:42.720 | Yeah.

00:19:43.440 | In that case, you're either overloading the expert

00:19:45.520 | or not considering all the tokens.

00:19:47.280 | It's like a trade-off.

00:19:49.840 | Yeah, in fact, expert choice applies pretty well

00:19:53.600 | to vision models, because vision models do not have causal mask.

00:19:58.400 | OK.

00:20:01.080 | Thank you so much.

00:20:02.160 | OK.

00:20:09.040 | I guess I can go to the next section, upcycling MOEs.

00:20:15.400 | So if you are going to remember one thing, remember this.

00:20:26.240 | So you can upcycle your dense models

00:20:29.160 | into a mix of experts.

00:20:31.920 | By training these upcycled models,

00:20:34.200 | you can achieve better accuracy than simply training

00:20:37.560 | the dense model further for the same number of flops.

00:20:42.240 | The context is we have so many big dense models.

00:20:46.280 | For example, there's a Lama 405B.

00:20:49.840 | And from NVIDIA, we have Nemotron 340B.

00:20:53.680 | These models are huge.

00:20:56.280 | It's very expensive to retrain MOE variant of it.

00:21:00.840 | If we want to further improve these models,

00:21:05.320 | you can upcycle these models into MOE

00:21:09.680 | to achieve better accuracy.

00:21:12.240 | On other scaling experiments, we tried on 15B models upcycling

00:21:20.040 | and applied on 1 trillion tokens and achieved roughly about 5%

00:21:25.800 | improvement in terms of the validation loss

00:21:29.080 | and 4% improvement on MMLU.

00:21:33.320 | It's exciting because the original sparse upcycling

00:21:38.280 | paper found it difficult to scale

00:21:42.000 | beyond 1 billion parameters.

00:21:45.480 | We found there are several key factors to go beyond 1 billion.

00:21:55.400 | So this is a concept of MOE.

00:21:59.840 | Let's say you have the MLP in the original plane boat

00:22:05.320 | transformer.

00:22:07.040 | And here, this is a mixture of two experts.

00:22:11.360 | One of the experts is activated, so the flop

00:22:15.160 | is the same as the original model.

00:22:17.400 | Now the parameters is increased.

00:22:21.400 | To upcycle such a model, you do two things.

00:22:26.640 | First is to copy the MLP layer into a number of expert copies.

00:22:35.920 | Here is just two copies.

00:22:38.200 | And then you randomly initialize the router rates.

00:22:41.800 | Then you just train this model.

00:22:43.640 | Pretty straightforward, right?

00:22:47.320 | There is one caveat here.

00:22:50.960 | This model needs to perform the same as the original model

00:22:56.000 | in the first forward pass.

00:22:58.440 | Otherwise, it will lead to catastrophe forgetting.

00:23:01.800 | So the trick to maintain this feature

00:23:12.240 | is through the swapping of the top-k softmax operator.

00:23:17.680 | This is introduced in Mixture 8x7b.

00:23:21.840 | Let's say you have the MLP copied,

00:23:24.960 | and then you do the top-k first to select two experts.

00:23:35.000 | And then you do the softmax on top

00:23:38.200 | of the logits from the top-k router.

00:23:43.640 | In this way, because the softmax is

00:23:46.160 | applied to the top-k output, it always sum up to 1.

00:23:51.960 | Here I give example 0.7, 0.3.

00:23:55.560 | And then you add these two outputs together.

00:24:01.240 | In this way, because the MLP layer

00:24:03.840 | is exactly the same as the dense model,

00:24:06.520 | these two copies are just the same.

00:24:09.160 | So the model output is the same as the original dense model.

00:24:13.800 | This is a very important feature in upcycling,

00:24:18.280 | because the upcycled MLE model actually

00:24:22.760 | behaves exactly the same as dense without any training.

00:24:27.400 | And this will help stabilize the model

00:24:30.160 | and avoid catastrophe forgetting.

00:24:34.120 | But the problem here is we actually

00:24:38.120 | found the mixed-source approach didn't work as well as

00:24:41.120 | expected, because the original switch transformer from Google

00:24:48.440 | uses softmax and top-k for a reason.

00:24:52.040 | And because of upcycling, if you switch to top-k, then softmax,

00:24:57.080 | it actually hurts some performance.

00:25:01.480 | The difference is simply the swap of top-k

00:25:06.560 | and the softmax order.

00:25:09.960 | I've already explained how mixed-source did top-k then

00:25:14.800 | softmax.

00:25:16.160 | So in the original switch transformer paper,

00:25:19.240 | you apply softmax first.

00:25:21.680 | So the probability of all of the experts would sum up to 1.

00:25:28.360 | So if you apply top-k, the output no longer sum up to 1.

00:25:34.120 | Here, it's the example 0.4, 0.2.

00:25:38.080 | So this is smaller than the original output.

00:25:40.920 | If you just train this model naively on a large scale,

00:25:46.720 | the model would catastrophically forget.

00:25:49.800 | But on a smaller scale model like 1B or under,

00:25:53.680 | you can kind of get away with this problem,

00:25:56.040 | because small model adapts very fast.

00:25:58.640 | This is probably one of the reasons

00:26:00.560 | the original sparse upcycling paper didn't go

00:26:03.960 | beyond 1 billion parameters.

00:26:06.720 | So we found a very simple approach to solve this problem.

00:26:16.360 | Because the output scale is smaller

00:26:18.520 | than the original model, we can simply

00:26:22.640 | scale up the MLP output by the number of experts

00:26:28.880 | divided by top-k.

00:26:30.680 | For example, if it's mixed-row 8 by 7,

00:26:34.400 | we simply scale the MLP layer output by 4x.

00:26:39.360 | This will solve the problem of the scale.

00:26:42.480 | And the model still behaves the same as the original model.

00:26:45.720 | You can train the upcycle model normally.

00:26:51.120 | And then we found with this approach,

00:26:54.360 | it consistently outperforms the mixed-row approach.

00:27:00.280 | So we can get the benefit of the original three-transformer.

00:27:07.000 | A bit of intuition behind why softmax and top-k is better.

00:27:13.680 | Because if you apply softmax to all of the experts,

00:27:19.200 | the probability distribution is always

00:27:23.600 | measured on all of the experts.

00:27:26.080 | However, in the swap case, the probability distribution

00:27:31.880 | is only on two experts.

00:27:33.840 | And it's dynamic.

00:27:35.120 | It's harder for the model to learn.

00:27:37.880 | Additionally, if you only use top-1,

00:27:41.800 | this method will not work.

00:27:43.640 | Because if the softmax on one expert is always 1,

00:27:49.240 | there wouldn't be any gradient to learn.

00:27:52.520 | So next, let's go to fine-grained MLE.

00:28:05.720 | This is very popular in the most recent MLEs.

00:28:09.720 | For example, the Quain V2 uses 64 experts.

00:28:15.400 | And DeepSeq V2 uses 120, 128, or 160-something experts.

00:28:23.360 | Granularity uses more experts, but smaller ones.

00:28:27.000 | So this gives the flexibility of more combinations

00:28:32.000 | of different experts, so more representation power.

00:28:37.640 | For example, here, originally you

00:28:40.840 | have the BigSurf2 expert, and 1 is selected.

00:28:44.800 | Instead, you can expand the number of experts into 4.

00:28:52.120 | Each expert is smaller than before.

00:28:56.200 | So the compute is the same.

00:28:58.880 | And the parameters is also the same.

00:29:02.440 | Here, some notation here.

00:29:05.480 | E2 means expansion factor is 2.

00:29:09.680 | This is how many times the expert is copied.

00:29:15.320 | And G2 is granularity, or how many times

00:29:21.840 | the expert is segmented.

00:29:24.720 | So you can think of this like the expert is first

00:29:29.880 | copied into two copies here.

00:29:32.720 | And then each copy is segmented two times.

00:29:37.400 | T2 means how many experts we route to.

00:29:45.320 | Here, you route to two experts.

00:29:48.120 | So the flop is the same as the original one.

00:29:55.680 | So you would soon notice a problem

00:29:59.680 | if you upcycle such model.

00:30:02.600 | Let's say these two segments are not the same anymore,

00:30:10.000 | because you segment one expert into two shards.

00:30:18.440 | For the top two cases, if you select one shard two times

00:30:22.960 | and the other shard zero time, the output

00:30:26.160 | is no longer the same as the original MOE.

00:30:30.000 | And we mentioned that it's very important for upcycling

00:30:34.800 | to maintain the same forward pass

00:30:38.800 | as the original dense model.

00:30:42.280 | The solution here is also rather straightforward.

00:30:46.760 | So instead of randomly initialize a router,

00:30:50.520 | we would initialize half as a router, and then duplicate it.

00:30:56.200 | So this would ensure that the probability distribution would

00:31:01.080 | be the same in each virtual group, or in each shard group.

00:31:07.760 | Because if these are the same, the top case selection

00:31:11.040 | will be exactly the same, the highest score

00:31:17.240 | would be the same for these two groups.

00:31:20.640 | I say here the example is that I shard the MLP into two parts.

00:31:29.280 | The first, I need to select the orange exacting ones

00:31:34.040 | and the blue exacting ones.

00:31:36.800 | By duplicating the router base, this achieves the purpose.

00:31:44.880 | And the formula for scaling the weights are also the same,

00:31:51.480 | except there's an additional granularity factor.

00:31:55.480 | This is a bit complicated, so I'll skip this part.

00:32:00.480 | So for our experiment, we do our experiment

00:32:10.280 | on the 8 trillion tokens Nemotron trained on.

00:32:14.120 | The ablation study is on a smaller model, Nemotron2B.

00:32:18.840 | And the bigger model, we do it on 8 by 15B on 1 trillion

00:32:23.360 | tokens.

00:32:23.860 | And we found that the learning rate

00:32:29.440 | is the most important hyperparameter, as always

00:32:32.480 | in machine learning.

00:32:34.360 | And this is true also for upcycling.

00:32:37.120 | Learning rate is the most important parameter.

00:32:41.560 | So in the original sparse upcycling paper,

00:32:46.440 | the learning rate is taken from the minimum--

00:32:51.440 | the ending learning rate from pre-training.

00:32:54.080 | So sometimes it can be rather small.

00:32:57.760 | So we found that if you have a high learning rate,

00:33:04.120 | it could help the upcycling a lot.

00:33:07.720 | Here is the orange line is the lowest learning rate.

00:33:11.600 | Basically, you continue to fine-tune the model into MOE.

00:33:18.440 | This learning rate is typical, for example,

00:33:21.520 | for other tasks like alignment.

00:33:24.360 | But for upcycling, the model needs

00:33:28.240 | to adapt to a new local minima.

00:33:31.520 | We need a larger learning rate for this.

00:33:34.400 | We found that the best is to use the original highest peak

00:33:39.800 | learning rate from pre-training, which works the best.

00:33:43.120 | And if you dive into the base of this upcycled model,

00:33:53.600 | you will find something really interesting.

00:33:57.680 | So if you apply a constant small learning rate,

00:34:02.560 | like a fine-tuning or alignment, the constant similarity

00:34:08.640 | between the base model and the upcycled model

00:34:12.440 | would be almost 1.

00:34:14.600 | This is true for most of the aligned model,

00:34:18.160 | for example, LamaChat versus LamaBase.

00:34:23.120 | If you use a high peak learning rate for upcycling,

00:34:27.520 | the constant similarity would be much lower, around 0.7.

00:34:34.560 | Also, we also analyzed the mixed-row 8x7 base

00:34:42.320 | versus mixed-row 7B.

00:34:44.800 | We found that the similarity is also

00:34:47.560 | around there, which means you need higher learning

00:34:52.720 | rates for upcycling.

00:34:57.160 | An additional experiment on number of experts,

00:35:02.760 | we found 64 experts is kind of like the sweet spot.

00:35:08.160 | If you increase the number of experts beyond 64,

00:35:12.320 | it provides diminishing return.

00:35:14.760 | Finally, this is the large-scale upcycling.

00:35:24.760 | We upcycled the 8x15B model on 1 trillion tokens.

00:35:30.000 | So there are three models here.

00:35:38.520 | Let me explain.

00:35:39.480 | The base model, 15B, is trained on 8 trillion tokens.

00:35:44.520 | This is pre-training data.

00:35:46.160 | So validation loss, 1.6, and the MMLU, 59.

00:35:52.320 | So the continual training model, it

00:35:56.360 | target more on the academic benchmarks

00:36:01.960 | to obtain higher performance on the evaluations.

00:36:07.320 | This is roughly 1 trillion tokens.

00:36:09.680 | The upcycling is performed on the same data for comparison.

00:36:14.480 | Continuous training is just a dense model of continuous

00:36:17.480 | training.

00:36:18.720 | You will notice that actually, the data plays the biggest

00:36:23.520 | factor.

00:36:25.200 | Even the base model continuous training,

00:36:30.600 | it can have a huge boost on MMLU.

00:36:34.960 | And the upcycled model is another 4% to 5% improvement

00:36:40.160 | on top of that.

00:36:41.760 | Continuous training is like 20% improvement because of data.

00:36:47.200 | This reminds us, again, data is the most important in ML.

00:36:51.880 | If you put the 5% improvement into the scaling law

00:37:03.080 | perspective, we can roughly gauge

00:37:06.360 | how much the 5% improvement means.

00:37:10.920 | Actually, the fine-grained MOE, the upcycled,

00:37:16.280 | this has the same plot as the dense model.

00:37:20.600 | And this has about 4% improvement

00:37:24.200 | in terms of the loss.

00:37:26.000 | 4% improvement if you plug in the scaling law from point AI,

00:37:30.600 | it's roughly represent 1.7x bigger model.

00:37:36.480 | And the non-fine-grained model, well, top two,

00:37:43.680 | this is the same config as the 8x7 mixed row.

00:37:48.800 | It has increased flops because of top two.

00:37:53.200 | That's roughly 1.7x more flops.

00:37:56.560 | And it's roughly two times as powerful as the original model.

00:38:03.280 | Given that we only spent like 1/8

00:38:05.880 | of the original pre-training compute,

00:38:10.080 | this is indeed some saving compared to training

00:38:14.760 | these MOEs from scratch.

00:38:16.280 | OK.

00:38:20.040 | Thank you for listening.

00:38:21.440 | I put the paper links, Microsoft Core MOE GitHub there

00:38:26.920 | and Nevo GitHub where we provide high-level training interface.

00:38:32.720 | You can also follow me on LinkedIn and Twitter.

00:38:37.560 | I can take questions now.

00:38:40.720 | Hey, nice to see you again.

00:38:43.720 | There are a whole bunch of questions.

00:38:45.200 | I think we'll stop the recording so we

00:38:46.760 | can open up for questions.

00:38:48.280 | But I think everyone is very excited by the presentations.

00:38:52.280 | There's a lot of questions.

00:38:53.400 | And also people want your slides.

00:38:55.120 | So let people know where to get their slides.

00:38:57.840 | Yeah, sure.

00:38:58.440 | I can share the slides.

00:39:00.680 | I can share this slide.