MoE Travelogue: 9. The Debate over Gate Normalization · English (unofficial) translations of posts at kexue.fm

Looking back at the history of MoE, we can see that in earlier years, the Router of MoE, when used as a Gate multiplied onto the Expert, almost always used Softmax activation, and it remains one of the standard forms of MoE to this day. However, to cooperate with Loss-Free load balancing, DeepSeek changed the activation function to Sigmoid and proved that this is also a competitive scheme, which prompted deeper thinking and attempts on the Router form.

Even within the discussion of Softmax, there are two slightly different approaches: first Softmax then select Top-k, or first select Top-k then Softmax? The latter can also be understood as applying a normalization after selecting Top-k, i.e., Re-Norm. So, whether the Gate’s activation function needs normalization, and if so, whether to select Top-k after normalization or to use Re-Norm, is the topic of this article.

Problem Description

We know that the general form of MoE is \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i \label{eq:moe-1} Here \boldsymbol{\rho} actually plays two roles: when it is used to select the Top-k Experts, its role is the Router; when it is multiplied onto the Experts, its role is the Gate. From the design of MoE, the core role of \boldsymbol{\rho} is clearly the Router, and the Gate’s function is to provide gradients for it during training.

The problem we want to discuss can also be understood as how to construct \boldsymbol{\rho}=(\rho_1, \rho_2, \cdots, \rho_n) more scientifically so that the Router can obtain better gradients. For a long time, the standard answer has been Softmax, i.e., \rho_i = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}} where \boldsymbol{s}=(s_1, s_2, \cdots, s_n) are the logits directly projected from a linear layer. However, although this answer is “standard”, the author has not found any explanation; it seems everyone directly accepted it and continued using it, which made the author once very confused about the training mechanism of MoE.

Other Choices

As mentioned at the beginning, DeepSeek tried Sigmoid activation in Loss-Free load balancing and later used it in DeepSeek-V3. Its success shows that non-Softmax activation can also achieve good results. This inspired everyone to try more general approaches. For example, ReMoE uses ReLU activation, and our geometric perspective in “MoE Travelogue: 1. Starting from Geometric Meaning” allows any non-negative activation function.

Furthermore, in the form of MoE, there is also the Re-Norm option, i.e., changing [eq:moe-1] to \boldsymbol{y} = \frac{\sum\limits_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i}{\sum\limits_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i} \label{eq:moe-2} That is, re-normalize the selected Top-k \rho_i. For Softmax, this is equivalent to using \boldsymbol{s} to select the Top-k, setting the unselected ones to -\infty, and then applying Softmax. The advantage of Re-Norm is that it makes the forward computation numerically more stable, but note that when using Re-Norm, k must be at least 2, otherwise \boldsymbol{\rho} will have no gradient at all and thus cannot be trained.

Judging from the current practices of various parties, the effects of these MoE variants are roughly similar, and none is significantly superior. Since practice cannot determine a winner, let us explore theoretically which form is more scientific.

Design Principles

Our goal is to find a first principle that is closer to the essence and use it to derive the gating mechanism of the current MoE.

So, the first question is naturally what this “principle” is. For simplicity, let’s first consider k=1. We know that the most important feature of MoE is sparsity: a Router first determines which Experts to activate, and then only those Experts are computed, thereby increasing the number of parameters while controlling the computation. If it is just this idea, then the naive model should be \boldsymbol{f}\left(\boldsymbol{e}_{\mathop{\text{argmax}}\boldsymbol{\rho}}\right) i.e., pick the one with the highest score from the Router \boldsymbol{\rho} and activate the corresponding Expert. This form is perfectly fine for inference, but during training, the Router will receive no gradient and thus cannot be updated. So we need to design gradients for the Router. How to design gradients for the Router? To answer this question, we must first think clearly: what kind of Router do we need?

Since only one Expert can be activated, we naturally hope that this Expert is the one with the best performance. If we denote the loss function by \ell, then our expectation can be written as \mathop{\text{argmax}}\boldsymbol{\rho} = \mathop{\text{argmin}}\, [\ell(\boldsymbol{e}_1),\ell(\boldsymbol{e}_2),\cdots,\ell(\boldsymbol{e}_n)]\label{eq:target} This is the design principle we are looking for.

Objective Transformation

However, the objective [eq:target] is not yet a loss function that can be directly used for training; it requires further transformation. To this end, we construct two distributions. The first is the target distribution \boldsymbol{q}=(q_1,q_2,\cdots,q_n) built based on the loss function, defined as q_i = \frac{e^{-\ell(\boldsymbol{e}_i)/\tau}}{\sum_{j=1}^n e^{-\ell(\boldsymbol{e}_j)/\tau}} This distribution is independent of the Router; for the Router’s learning, it is the “target distribution.” The second distribution is the predicted distribution \boldsymbol{p} constructed from \boldsymbol{\rho}. Here there are many possibilities; for example, \boldsymbol{\rho} itself can be the distribution \boldsymbol{p} (if \boldsymbol{\rho} is already normalized), or \boldsymbol{p} can be the Softmax of \boldsymbol{\rho} (in which case \boldsymbol{\rho} are logits), or normalization methods other than Softmax. In short, \boldsymbol{p} is some probability representation of the Router, and let its parameters be \boldsymbol{\theta}.

We convert the objective [eq:target] into bringing \boldsymbol{p} closer to \boldsymbol{q}, thereby providing gradients for \boldsymbol{\theta}. To do this, we consider minimizing the KL divergence KL(\boldsymbol{p}\Vert \boldsymbol{q}) = \sum_{i=1}^n p_i \log \frac{p_i}{q_i} A simple rearrangement yields KL(\boldsymbol{p}\Vert \boldsymbol{q}) = - \mathcal{H}(\boldsymbol{p}) + \frac{1}{\tau}\sum_{i=1}^n p_i \ell(\boldsymbol{e}_i) - \log \sum_{i=1}^n e^{-\ell(\boldsymbol{e}_i)/\tau} We can see that this objective consists of three terms. The first term is negative entropy -\mathcal{H}(\boldsymbol{p}); minimizing it means maximizing entropy, which actually encourages the model to explore fully. We can consider that load balancing has already assumed a similar role, so we ignore it for now. The third term is independent of \boldsymbol{p}, i.e., independent of \boldsymbol{\theta}, so the equivalent loss function is \mathcal{L} = \sum_{i=1}^n p_i \ell(\boldsymbol{e}_i).

Straight-Through Estimator

Taking the gradient of the equivalent loss, we get \nabla_{\boldsymbol{\theta}}\mathcal{L} = \sum_{i=1}^n \nabla_{\boldsymbol{\theta}} p_i \cdot \ell(\boldsymbol{e}_i) = \sum_{i=1}^n p_i \nabla_{\boldsymbol{\theta}} \log p_i \cdot \ell(\boldsymbol{e}_i) = \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}}\log p_i \cdot \ell(\boldsymbol{e}_i)] The key here is to use \nabla_{\boldsymbol{\theta}} p_i = p_i \nabla_{\boldsymbol{\theta}} \log p_i to separate out a factor p_i, so that the sum can be transformed into an expectation, and then through sampling to achieve the sparse computation goal of MoE. Some readers may have recognized that this is exactly REINFORCE in policy gradient! (Refer to “From Sampling to Optimization: A Unified View of Differentiable and Non-Differentiable Optimization” and “Policy Gradient and Zero-Order Optimization: Different Paths to the Same Goal”).

The problem with REINFORCE is that it has high variance. Intuitively, this is because it puts p_i outside the loss function \ell; if possible, we would prefer to use a “reparameterization” form where p_i is inside \ell. To derive such a form, we exploit the invariance of REINFORCE to subtracting a baseline: \begin{aligned} \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}}\log p_i \cdot \ell(\boldsymbol{e}_i)] =&\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}}\log p_i \cdot (\ell(\boldsymbol{e}_i) - \ell(\boldsymbol{0}))] \\[4pt] \approx&\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}}\log p_i \cdot \langle\nabla_{\boldsymbol{e}_i} \ell(\boldsymbol{e}_i), \boldsymbol{e}_i - \boldsymbol{0}\rangle] \\[4pt] = &\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}} \langle\nabla_{\boldsymbol{e}_i} \ell(\boldsymbol{e}_i), \log p_i \cdot \boldsymbol{e}_i\rangle] \\[4pt] = &\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}} \ell((\log p_i + \color{skyblue}{[}1 - \log p_i \color{skyblue}{]_{\text{sg}}}) \cdot\boldsymbol{e}_i)] \\[4pt] = &\, \nabla_{\boldsymbol{\theta}} \mathbb{E}_{i\sim \boldsymbol{p}} [\ell((\log p_i + \color{skyblue}{[}1 - \log p_i \color{skyblue}{]_{\text{sg}}}) \cdot\boldsymbol{e}_i)] \\ \end{aligned} where the approximate equality \approx is a first-order Taylor expansion at \boldsymbol{e}_i, and \color{skyblue}{[}\cdot\color{skyblue}{]_{\text{sg}}} stands for Stop Gradient. In the end, we obtain a Straight-Through Estimator (STE) that uses 1 in the forward pass and \log p_i in the backward pass to provide gradients for the Router.

Final Form

Although STE can provide a feasible training scheme, due to the inconsistency between forward and backward propagation, it often only achieves suboptimal results. At this point, a very magical improvement is—changing each Expert to p_i\boldsymbol{e}_i! Repeating the above derivation, we have \begin{aligned} \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}}\log p_i \cdot \ell(p_i\boldsymbol{e}_i)] =&\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}}\log p_i \cdot (\ell(p_i\boldsymbol{e}_i) - \ell(\boldsymbol{0}))] \\[4pt] \approx&\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}}\log p_i \cdot \langle\nabla_{p_i\boldsymbol{e}_i} \ell(p_i\boldsymbol{e}_i), p_i\boldsymbol{e}_i - \boldsymbol{0}\rangle] \\[4pt] = &\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}} \langle\nabla_{p_i \boldsymbol{e}_i} \ell(p_i \boldsymbol{e}_i), p_i \boldsymbol{e}_i\rangle] \\[4pt] = &\, \mathbb{E}_{i\sim \boldsymbol{p}} [\nabla_{\boldsymbol{\theta}} \ell(p_i \boldsymbol{e}_i)] \\[4pt] = &\, \nabla_{\boldsymbol{\theta}} \mathbb{E}_{i\sim \boldsymbol{p}} [\ell(p_i \boldsymbol{e}_i)] \\[4pt] \end{aligned} This transformation process is very exquisite and worth savoring carefully. By changing the Expert from \boldsymbol{e}_i to p_i\boldsymbol{e}_i, we eliminate the Stop Gradient, achieve consistency between forward and backward passes, and theoretically raise the ceiling of model performance.

Now, we can answer the question from the beginning:

If we need a top-down probabilistic derivation, then the Router as a Gate should be normalized, but should not use Re-Norm.

To Sample or Not

A noteworthy detail is that \mathbb{E}_{i\sim \boldsymbol{p}} implies we should sample from \boldsymbol{p}, but in practice we usually directly select the Top-k. How should we understand this discrepancy?

This is actually a trade-off between diversity and stability. Random sampling encourages the model to explore more fully, but sampling increases gradient variance and introduces additional instability; directly selecting Top-k is more stable, but there is concern that the model may fall into suboptimal solutions or even collapse. Fortunately, various load balancing strategies are now quite mature and to some extent already encourage comprehensive exploration, so selecting Top-k remains the mainstream.

If one wants to sample while maintaining stability, one can slightly expand on the basis of Top-k rather than fully open sampling. For example, first select Top-k+c, then randomly pick k Experts from the k+c ones; or add slight noise to the logits of \boldsymbol{p} and then select Top-k. This way, while increasing randomness, it does not stray too far from the original Top-k, balancing stability.

Article Summary

This article attempts to explore, from first principles, the design issues of Router and Gate in MoE, providing a probabilistic interpretation for the normalization of gating.

For reprinting, please include the address of this article: https://kexue.fm/archives/11782

For more detailed reprint matters, please refer to: “Science Space FAQ”