MoE Travelogue: 6. Optimal Allocation Promotes Balance · English (unofficial) translations of posts at kexue.fm

Review of Methods

Among the two existing methods, Aux Loss is relatively simple: its core idea is “penalize wherever there is imbalance” by adding a regularization term that penalizes uneven load. However, Aux Loss has two problems: first, the penalty coefficient is tricky to tune—if too large it interferes with the main loss optimization, if too small the balancing effect is poor; second, Aux Loss relies on the Straight-Through Estimator (STE), which means its gradient is suboptimal and may introduce unknown effects beyond load balancing.

To address this, DeepSeek proposed the second approach, Loss-Free, which introduces an additional bias term to assist ranking, as shown below: \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i \qquad\to\qquad \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argtop}}_k (\boldsymbol{\rho} + \boldsymbol{b})} \rho_i \boldsymbol{e}_i Note that \boldsymbol{b} is only used to adjust the ranking of experts; the weight multiplied onto the expert is still \rho_i, so it does not directly participate in the model’s computation and does not introduce interfering gradients. However, the lack of gradient for \boldsymbol{b} means we need to design a separate update rule. The idea is straightforward: the larger b_i is, the higher the chance that expert i is selected. So we first compute the current load distribution \boldsymbol{F}; if F_i exceeds the desired value 1/n, we decrease b_i, otherwise increase b_i: \boldsymbol{b} \leftarrow \boldsymbol{b} - \gamma \operatorname{sign}(\boldsymbol{F} - 1/n)

Overall, Loss-Free is less intrusive to the model and indeed simpler and more elegant, but it is not perfect. Although it does not have a penalty coefficient like Aux Loss, it still has a \gamma parameter to tune, which acts as the learning rate for \boldsymbol{b}; the original paper recommends \gamma=10^{-3}. This is tightly coupled with the sigmoid activation used for \boldsymbol{\rho}; if the activation function is changed, \gamma needs to be retuned.

Furthermore, even with sigmoid activation, some layers may have a very skewed \boldsymbol{\rho} distribution, making the model sensitive to \gamma; a constant \gamma struggles to achieve balance. This is not uncommon: for example, if the first few layers use MoE, they often have difficulty balancing, which is why the ‘first_k_dense’ trick exists; also when the model is large or the total number of experts n is large, individual layers can easily become hard to balance.

Linear Programming

Let us formulate the problem more precisely: Suppose there are m tokens, and the router for the i-th token gives scores \boldsymbol{s}_i = (s_{i,1},s_{i,2},\dots,s_{i,n}) for the n experts. So there are mn scores in total; they may be positive or negative and need not lie within a predefined range. We want to use these scores to devise an assignment plan that decides which experts each token should activate.

We stipulate that each token selects k experts. A natural baseline is to pick the k experts with the highest scores, but this can cause load imbalance—some experts may be activated significantly more or less often. Therefore we additionally require that each expert be activated exactly mk/n times. Under these two constraints, we seek the assignment that maximizes the total score, formalized as \max_{x_{i,j}\in\{0,1\}} \sum_{i,j} x_{i,j}s_{i,j} \qquad \text{s.t.} \qquad \sum_j x_{i,j} = k,\quad \sum_i x_{i,j} = \frac{mk}{n} \label{eq:target} Here x_{i,j}=1 means token i selects expert j, and x_{i,j}=0 means it does not. Note that mk/n must be an integer for the equality constraints to hold exactly; we assume this for now. Here each expert is activated exactly mk/n times, which is the ideal perfectly uniform state. In practice the constraints can be relaxed, but for theoretical derivation we adopt the strictest ones.

In the above, x_{i,j} can only be 0 or 1, which is an integer programming problem. Discrete optimization is generally difficult, so we consider its relaxation: \max_{x_{i,j}\in[0,1]} \sum_{i,j} x_{i,j}s_{i,j} \qquad \text{s.t.} \qquad \sum_j x_{i,j} = k,\quad \sum_i x_{i,j} = \frac{mk}{n} \label{eq:relax} Now x_{i,j} can be any real number in [0,1], all else unchanged. Note that x_{i,j} appears linearly in both the objective and constraints, so this is a linear programming problem over a bounded region.

Max-Min

Constrained optimization is also nontrivial, so we consider the max-min form that removes the constraints (i.e., the method of Lagrange multipliers): \max_{x_{i,j}\in[0,1]}\min_{\alpha_i,\beta_j} \sum_{i,j} x_{i,j}s_{i,j} - \sum_i \alpha_i\left(\sum_j x_{i,j} - k\right) - \sum_j \beta_j\left(\sum_i x_{i,j} - \frac{mk}{n}\right) \label{eq:relax-max-min} If \sum_j x_{i,j} = k and \sum_i x_{i,j} = mk/n, then the above is equivalent to \eqref{eq:relax} and the maximum is finite; but if either constraint is violated, the inner minimization can yield -\infty, so the maximum would be -\infty. Clearly -\infty is worse than any finite value, so the optimal must satisfy the constraints. Hence it is equivalent to \eqref{eq:relax}.

The objective \eqref{eq:relax-max-min} is linear in x_{i,j},\alpha_i,\beta_j; linear functions are both convex and concave, and [0,1] is a convex set, so it satisfies the conditions of the Minimax theorem. We may swap the order of max and min, yielding: \min_{\alpha_i,\beta_j} \max_{x_{i,j}\in[0,1]} \sum_{i,j} x_{i,j}(s_{i,j} - \alpha_i - \beta_j) + k\sum_i \alpha_i + \frac{mk}{n}\sum_j \beta_j \label{eq:relax-min-max} Here we have separated the terms involving x_{i,j},\alpha_i,\beta_j. A closer look reveals that the inner max can be solved explicitly: when s_{i,j} - \alpha_i - \beta_j > 0, we set x_{i,j}=1 to maximize; when it is negative, we set x_{i,j}=0; when it equals zero, any x_{i,j}\in[0,1] yields the same value. That is: \begin{cases} x_{i,j}^* = 1, & s_{i,j} - \alpha_i - \beta_j > 0 \\ x_{i,j}^* = 0, & s_{i,j} - \alpha_i - \beta_j < 0 \\ x_{i,j}^* \in [0,1], & s_{i,j} - \alpha_i - \beta_j = 0 \end{cases} The case s_{i,j} - \alpha_i - \beta_j = 0 is special; assuming it occurs with negligible probability, x_{i,j}^* is essentially 0 or 1. Even if it does occur, by the arbitrariness of x_{i,j}^* we can also set it to 0 or 1 to satisfy the constraints. Thus, although the relaxed problem \eqref{eq:relax} is a relaxation, its optimal solution is also optimal for the original integer problem \eqref{eq:target}; they are completely equivalent.

Divide and Conquer

Substituting the optimal x_{i,j}^* into \eqref{eq:relax-min-max} gives x_{i,j}^*(s_{i,j} - \alpha_i - \beta_j) = \max(0, s_{i,j} - \alpha_i - \beta_j), so the objective simplifies to \min_{\alpha_i,\beta_j} \sum_{i,j} \max(0, s_{i,j} - \alpha_i - \beta_j) + k\sum_i \alpha_i + \frac{mk}{n}\sum_j \beta_j We will solve it using alternating minimization: first fix \beta_j and solve for \alpha_i, then fix \alpha_i and solve for \beta_j, iterating. Since \alpha_i and \beta_j are symmetric, these two steps are essentially the same problem. Consider fixing \beta_j and optimizing \alpha_i: \min_{\alpha_i} \sum_{i,j} \max(0, s_{i,j} - \alpha_i - \beta_j) + k\sum_i \alpha_i We can also observe that each \alpha_i appears independently, so we can decompose this into m independent subproblems. This means we can temporarily drop the index i and further simplify to: \min_{\alpha} k\alpha + \sum_j \max(0, s_j - \beta_j - \alpha) Sort the values s_j - \beta_j in descending order: s_{\sigma_1} - \beta_{\sigma_1} \geq s_{\sigma_2} - \beta_{\sigma_2} \geq \cdots \geq s_{\sigma_n} - \beta_{\sigma_n}, where s_{\sigma_j} - \beta_{\sigma_j} is the j-th largest. If we suppose that s_{\sigma_l} - \beta_{\sigma_l} \geq \alpha \geq s_{\sigma_{l+1}} - \beta_{\sigma_{l+1}}, then the objective becomes k\alpha + \sum_{j=1}^l (s_{\sigma_j} - \beta_{\sigma_j} - \alpha) = \begin{cases} \displaystyle \sum_{j=1}^k (s_{\sigma_j} - \beta_{\sigma_j}) + \sum_{j=k+1}^l \underbrace{(s_{\sigma_j} - \beta_{\sigma_j} - \alpha)}_{\geq 0}, & l \geq k \\[12pt] \displaystyle \sum_{j=1}^k (s_{\sigma_j} - \beta_{\sigma_j}) - \sum_{j=l+1}^k \underbrace{(s_{\sigma_j} - \beta_{\sigma_j} - \alpha)}_{\leq 0}, & l \leq k \end{cases} This shows that whether l > k or l < k, the objective is increased; therefore the minimum can only be achieved at l=k. In that case the result no longer depends on the exact value of \alpha, i.e., \alpha^* can be any number between the k-th and (k+1)-th largest values of s_j - \beta_j. By convention, we take \alpha^* as the (k+1)-th largest.

Alternating Iteration

Restoring the index i, we have: for each i, \alpha_i^* is the (k+1)-th largest value among s_{i,j} - \beta_j (over j). Similarly, fixing \alpha_i and solving for \beta_j yields: for each j, \beta_j^* is the (mk/n+1)-th largest value among s_{i,j} - \alpha_i (over i). We alternate these two steps as the final algorithm.

Once we have obtained sufficiently accurate \boldsymbol{\alpha}^* and \boldsymbol{\beta}^*, the earlier analysis implies that x_{i,j}^* automatically becomes 0 or 1 and satisfies the constraints, with x_{i,j}^*=1 corresponding to s_{i,j} - \alpha_i^* - \beta_j^* > 0. Combining with the constraint \sum_j x_{i,j}^* = k, we conclude that for each token i, the selected experts are precisely the top-k of \boldsymbol{s}_i - \boldsymbol{\beta}^*.

This tells us that during inference only \boldsymbol{\beta}^* is needed; \boldsymbol{\alpha}^* is merely an intermediate variable in the solution process and can be discarded after training. This is crucial because \boldsymbol{\beta} has a fixed size n, while \boldsymbol{\alpha} has size m, the global batch size, which varies dynamically and is not a suitable format for inference. The solution procedure that exploits this property is illustrated below, where \mathop{\mathrm{desc\_sort}} denotes descending sort.

Input: score matrix $\boldsymbol{s}\in\mathbb{R}^{m\times n}$
Output: assignment $\boldsymbol{x}\in\{0,1\}^{m\times n}$

1: Initialize $\boldsymbol{\beta} = \mathbf{0}_{1\times n}$
2: For $t=1,2,\dots,T$ do
3:     $\boldsymbol{\alpha} \leftarrow \descsort(\boldsymbol{s} - \boldsymbol{\beta}, \text{axis=1})_{[:, k:k+1]}$
4:     $\boldsymbol{\beta} \leftarrow \descsort(\boldsymbol{s} - \boldsymbol{\alpha}, \text{axis=0})_{[mk/n:mk/n+1]}$
5: Output $x_{i,j}=1$ if $j\in\argtop_k(\boldsymbol{s}_i - \boldsymbol{\beta})$ else $0$

An additional refinement is that we can use the concept of “quantile” to unify the operations: the (k+1)-th largest among n numbers and the (mk/n+1)-th largest among m numbers are both the “1-k/n quantile” along their respective dimensions. Numerical frameworks such as NumPy, JAX, and PyTorch provide a quantile function, which avoids a full sort and saves some complexity.

For this reason we call this algorithm Quantile Balancing (QB).

Watch Out for Pitfalls

But it is not time to celebrate yet; there is a subtle trap that can lead to fundamentally wrong results if overlooked.

As mentioned, QB only needs \boldsymbol{\beta} for inference, and storing only \boldsymbol{\beta} during training is sufficient. So from a Loss-Free perspective, QB provides a new bias update method—one could even call it “Quantile Bias”. Whether the original SignSGD or our Quantile, both rely on the scores of all tokens, so the order must not be mistaken: one must use the old \boldsymbol{\beta} to select experts for the current batch, and only then update \boldsymbol{\beta}, otherwise there is information leakage.

Some readers may wonder: what information can a bias vector that does not directly participate in the forward computation leak? Indeed, intuitively the amount of leaked information seems minimal, but the risk still exists. When training small models we might try the leaky version, but for large models we should not take this risk, because a large model is powerful enough to amplify any subtle bug. Therefore, maintaining consistency between training and inference and eliminating any risk of information leakage is a basic discipline for training large models.

In practical training, QB can be further adjusted: instead of starting from zero and iterating T times, we can start from the previous step’s bias and perform only one iteration per step. This avoids overfitting to the current batch and reduces computation. The top-k selection based on \boldsymbol{s} - \boldsymbol{\beta} is already required in the original MoE; now it only changes to top-(k+1), which adds almost no cost. The extra step is finding the 1-k/n quantile along \text{axis=0} using \boldsymbol{s} - \boldsymbol{\alpha}.

Input: score matrix $\boldsymbol{s}\in\mathbb{R}^{m\times n}$, previous $\boldsymbol{\beta}\in\mathbb{R}^n$
Output: assignment $\boldsymbol{x}\in\{0,1\}^{m\times n}$, new $\boldsymbol{\beta}\in\mathbb{R}^n$

1: $x_{i,j}=1$ if $j\in\argtop_k(\boldsymbol{s}_i - \boldsymbol{\beta})$ else $0$
2: $\boldsymbol{\alpha} \leftarrow \descsort(\boldsymbol{s} - \boldsymbol{\beta}, \text{axis=1})_{[:, k:k+1]}$
3: $\boldsymbol{\beta} \leftarrow \descsort(\boldsymbol{s} - \boldsymbol{\alpha}, \text{axis=0})_{[mk/n:mk/n+1]}$
4: Output $\boldsymbol{x},\boldsymbol{\beta}$

However, even with just one iteration, the extra step is still relatively expensive: it requires finding the (mk/n+1)-th largest among m elements, where m equals (global sample count \times sequence length), usually on the order of millions. Under various parallelism strategies and gradient accumulation, an exact implementation is often unacceptable. A compromise is to split the samples into the largest acceptable mini-batches, compute a \boldsymbol{\beta} for each mini-batch using the formula, and then average them as the final result.

Once these issues are addressed, QB is almost all advantages: first, it has no hyperparameters like a learning rate to tune; second, it balances extremely fast and is especially good at handling extreme cases—for example, when training a fully MoE model, even the first MoE layer becomes extremely balanced. However, for layers that can already be balanced by the original SignSGD rule, QB usually does not offer additional advantages.

Comparison of MaxVio for the first MoE layer.

Demo Code

Below is a demo code snippet for interested readers to experiment with:

import numpy as np

def quantile_bias(s, k, T=5):
    """Alternating quantile to find optimal bias
    Principle: https://kexue.fm/archives/11619
    """
    m, n = s.shape
    beta = np.zeros((1, n))
    for _ in range(T):
        alpha = np.quantile(s - beta, 1 - k / n, axis=1, keepdims=True)
        # alpha = alpha.clip(0, np.inf)  # BIP adds this step
        beta = np.quantile(s - alpha, 1 - k / n, axis=0, keepdims=True)
        # beta = beta.clip(0, np.inf)  # BIP adds this step
    return beta

def max_min_avg_vio(s, k):
    """Compute max_vio, min_vio, and avg_vio
    where max_vio >= 0, avg_vio >= 0, -1 <= min_vio <= 0;
    the closer to 0, the more balanced.
    """
    m, n = s.shape
    topk = np.argsort(-s, axis=1)[:, :k]
    f = np.bincount(topk.reshape(-1), minlength=n)
    f = f / f.sum() * n - 1
    return f.max(), f.min(), np.abs(f).mean()

m, n, k = 100000, 256, 8
s = np.random.rand(m, n) + np.random.rand(n)  # simulate an uneven scoring

b = quantile_bias(s, k, 5)
max_min_avg_vio(s, k)      # max_vio, min_vio, avg_vio of direct top-k
max_min_avg_vio(s - b, k)  # max_vio, min_vio, avg_vio after subtracting bias

Related Work

The perspective of viewing MoE load balancing as an optimal assignment problem first appeared in the paper “BASE Layers: Simplifying Training of Large, Sparse Models”, but a complete general solution was proposed in “Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models” (BIP for short). QB is improved from BIP.

Compared to QB’s objective \eqref{eq:target}, BIP changes the equality constraints to inequality constraints: \max_{x_{i,j}\in\{0,1\}} \sum_{i,j} x_{i,j}s_{i,j} \qquad \text{s.t.} \qquad \sum_j x_{i,j} \leq k,\quad \sum_i x_{i,j} \leq \frac{mk}{n} \label{eq:target-leq} Then repeating the same derivation steps, at the max-min stage there is an additional constraint \alpha_i\geq 0, \beta_j\geq 0, i.e., we must consider \max_{x_{i,j}\in[0,1]}\min_{\alpha_i\geq 0,\beta_j\geq 0} \sum_{i,j} x_{i,j}s_{i,j} - \sum_i \alpha_i\left(\sum_j x_{i,j} - k\right) - \sum_j \beta_j\left(\sum_i x_{i,j} - \frac{mk}{n}\right) so that it is equivalent to the relaxed version of \eqref{eq:target-leq}. Then max and min can still be swapped, and following a similar derivation, the update rules for \boldsymbol{\alpha},\boldsymbol{\beta} acquire an additional \max(0,\cdot) clipping to enforce non-negativity: \begin{aligned} \boldsymbol{\alpha} &\leftarrow \max(0, \mathop{\mathrm{desc\_sort}}(\boldsymbol{s} - \boldsymbol{\beta}, \text{axis=1})_{[:, k:k+1]}) \\ \boldsymbol{\beta} &\leftarrow \max(0, \mathop{\mathrm{desc\_sort}}(\boldsymbol{s} - \boldsymbol{\alpha}, \text{axis=0})_{[mk/n:mk/n+1]}) \end{aligned} However, the author’s tests found that this clipping significantly slows down load balancing and often results in cases where only the “rich” are penalized but the “poor” cannot be helped (high-frequency experts are suppressed, but low-frequency experts cannot be rescued). This indicates that the clipping is actually counterproductive. Therefore, QB directly adopts equality constraints, which removes the non-negativity constraint on \boldsymbol{\alpha},\boldsymbol{\beta} and makes the solution maximally simple.

Moreover, BIP seems to have made the mistake mentioned in the previous section: it updates \boldsymbol{\beta} first and then selects top-k, which violates causality (leaking future information) and the principle of training-inference consistency (across samples). Nevertheless, BIP’s thorough analysis of load balancing from the optimal assignment perspective remains highly inspiring.

After searching, the author found that some subsequent works have explored along this direction. They overlap with this article to some extent but are not entirely identical; we will not discuss them one by one here:

Maximum Score Routing For Mixture-of-Experts

Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

MicroMoE: Fine-Grained Load Balancing for Mixture-of-Experts with Token Scheduling

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

Gradient Descent

Finally, we introduce a solution that lies between Loss-Free and QB. We know that in the QB iteration, computing \boldsymbol{\alpha} is relatively cheap, while the expensive part is computing \boldsymbol{\beta}, which requires some sort over all tokens. Given \boldsymbol{\alpha}, the optimization objective for \boldsymbol{\beta} is \min_{\beta_j} \underbrace{\sum_{i,j} \max(0, s_{i,j} - \alpha_i - \beta_j) + \frac{mk}{n}\sum_j \beta_j}_{\text{denoted } \ell} Besides using quantile to find its optimal solution, is there a cheaper way to obtain an approximate solution? Indeed there is! Clearly the objective \ell is differentiable, so we can use gradient descent. Its gradient is \frac{\partial\ell}{\partial\beta_j} = \frac{mk}{n} - \sum_{i=1}^m \chi(s_{i,j} - \alpha_i - \beta_j > 0) where \chi is the indicator function, \chi(\text{True})=1,\chi(\text{False})=0. Obviously computing this gradient is also relatively cheap. With the gradient, we can perform gradient descent; to align with Loss-Free, we consider SignSGD: \beta_j \leftarrow \beta_j - \gamma \operatorname{sign}\left(\frac{\partial\ell}{\partial\beta_j}\right) Using this to replace the quantile update of \boldsymbol{\beta} in QB yields a cost similar to Loss-Free; empirical results lie between the two (with sigmoid activation and the same \gamma). In fact, we can prove that if, for each row of the score matrix \boldsymbol{s} - \boldsymbol{\beta}, the k-th and (k+1)-th largest elements are unequal, then this scheme is strictly equivalent to Loss-Free.

Conclusion

In this article, we explored the MoE load balancing problem from the perspective of optimal allocation and derived a new auxiliary-loss-free load balancing algorithm, Quantile Balancing, which is more stable and accurate than existing Loss-Free schemes, works for router scores with arbitrary ranges, and has no extra hyperparameters to tune.

For reposting, please include the link to this article: https://kexue.fm/archives/11619

For more details on reposting, please refer to: Science Space FAQ