In the previous article A Tour of MoE: 6. Optimizing Assignment for Load Balancing, we achieved load balancing by solving the following optimal assignment problem: \max_{x_{i,j}\in\{0,1\}} \sum_{i,j} x_{i,j}s_{i,j} \qquad\text{s.t.}\qquad \sum_j x_{i,j} = k,\quad \sum_i x_{i,j} = \frac{mk}{n} where \sum_j x_{i,j} = k means each token activates exactly k experts, and \sum_i x_{i,j} = mk/n means each expert is activated exactly mk/n times. However, upon closer thought, one realizes that the former is actually unnecessary for both training and inference. What we truly need is the latter, which implies that on average each token activates k experts and each expert has a balanced load. This is sufficient to achieve the goals of MoE. Therefore, this article considers a simplified problem: \max_{x_{i,j}\in\{0,1\}} \sum_{i,j} x_{i,j}s_{i,j} \qquad\text{s.t.}\qquad \sum_i x_{i,j} = \frac{mk}{n}\label{eq:target-dyn}
Dynamic Activation
Assuming the reader is familiar with the previous article, the derivations here will be somewhat abbreviated. If any step is unclear, please refer back to the previous article for review. As before, we first consider a relaxed version of [eq:target-dyn]: \max_{x_{i,j}\in[0,1]} \sum_{i,j} x_{i,j}s_{i,j} \qquad\text{s.t.}\qquad \sum_i x_{i,j} = \frac{mk}{n} Then consider its equivalent max-min form: \max_{x_{i,j}\in[0,1]}\min_{\beta_j} \sum_{i,j} x_{i,j}s_{i,j} - \sum_j \beta_j\left(\sum_i x_{i,j} - \frac{mk}{n}\right) Here the order of \max and \min is interchangeable. Swapping them and rearranging gives \min_{\beta_j}\max_{x_{i,j}\in[0,1]} \sum_{i,j} x_{i,j}(s_{i,j} - \beta_j) + \frac{mk}{n} \sum_j \beta_j\label{eq:relax-min-max} The \max step can be solved first, yielding \left\{\begin{aligned}&\,x_{i,j}^* = 1, &\, s_{i,j} - \beta_j > 0 \\ &\,x_{i,j}^* = 0, &\, s_{i,j} - \beta_j < 0 \\ &\,x_{i,j}^* \in [0,1], &\, s_{i,j} - \beta_j = 0 \end{aligned}\right. This tells us that any expert with s_{i,j} - \beta_j > 0 is activated; the activation count is indeterminate. This formally coincides with A Tour of MoE: 4. Invest More Where It’s Difficult, except that we previously designed it intuitively, whereas here it is derived from the more fundamental objective [eq:target-dyn].
One-Step Solution
Substituting x_{i,j}^* back into [eq:relax-min-max] gives x_{i,j}^*(s_{i,j} - \beta_j) = \max(0, s_{i,j} - \beta_j), so the objective for \beta_j simplifies to \min_{\beta_j} \sum_{i,j} \max(0, s_{i,j} - \beta_j) + \frac{mk}{n} \sum_j \beta_j\label{eq:beta-obj} Each term involving \beta_j is independent, so it can be decomposed into m separate sub-optimization problems. This means we can temporarily drop the subscript j: \min_{\beta} \frac{mk}{n}\beta + \sum_i \max(0, s_i - \beta) Sort all s_i in descending order as s_{\sigma_1}\geq s_{\sigma_2} \geq \cdots \geq s_{\sigma_m}, and suppose we know s_{\sigma_l}\geq\beta\geq s_{\sigma_{l+1}}. Then we can remove the \max to obtain \frac{mk}{n}\beta + \sum_{i=1}^l (s_{\sigma_i} - \beta) = \left\{\begin{aligned} &\,\sum_{i=1}^{mk/n} s_{\sigma_i} + \sum_{i=mk/n+1}^l \underbrace{(s_{\sigma_i} - \beta)}_{\geq 0},&\, l \geq mk/n \\ &\,\sum_{i=1}^{mk/n} s_{\sigma_i} - \sum_{\hphantom{ab}i=l+1\hphantom{ab}}^{mk/n} \underbrace{(s_{\sigma_i} - \beta)}_{\leq 0},&\, l \leq mk/n \\ \end{aligned}\right. This shows that both l > mk/n and l < mk/n increase the objective, so the minimum is achieved at l=mk/n. That is, \beta^* lies between the mk/n-th and (mk/n+1)-th largest elements of s_i. By convention, we take the (mk/n+1)-th largest element. Reinstating the subscript j, for a given j, \beta_j^* is the (mk/n+1)-th element when sorting s_{i,j} in descending order, or equivalently, the 1-k/n quantile.
Note that now the only variable to solve for is \boldsymbol{\beta}, so there is no need for alternating iterations. A single step of quantile computation yields the perfectly balanced optimal solution! We still call this Quantile Balancing (QB).
Practical Considerations
If the reader is already familiar with A Tour of MoE: 6. Optimizing Assignment for Load Balancing, they will find that this article is essentially a simplified version. However, simplicity does not mean triviality; it is actually more elegant and efficient than Top-k MoE. First, it activates experts by checking whether \boldsymbol{s} - \boldsymbol{\beta} > 0, which is more efficient than selecting the Top-k. Second, it obtains the optimal load-balancing solution with a single quantile step, which is also more efficient and elegant.
Of course, the one-step optimal solution may overfit the current training batch. To mitigate this, the considered improvement is to apply an EMA (Exponential Moving Average) between the current batch’s optimal solution and the historical solution (indeed, the Top-k version from the previous article can also consider EMA). Note that we must also be careful about the trap of information leakage, so we should first use the old \boldsymbol{\beta} to activate the experts, and only then update \boldsymbol{\beta}.
The algorithm flow is as follows (in the author’s experiments, \lambda=0.9): \begin{array}{|l|} \hline \text{Quantile Balancing (QB) --- Dynamic Activation Version} \\[4pt] \hline \text{Input: score matrix } \boldsymbol{s}\in\mathbb{R}^{m\times n}, \text{ previous } \boldsymbol{\beta}\in\mathbb{R}^n, \text{ decay rate } \lambda \\ \text{Output: assignment } \boldsymbol{x}\in\{0,1\}^{m\times n}, \text{ new } \boldsymbol{\beta}\in\mathbb{R}^n \\[4pt] \hline \begin{array}{ll} 1: & x_{i,j}=1 \text{ if } s_{i,j} - \beta_j > 0 \text{ else } 0 \\ 2: & \boldsymbol{\beta} \leftarrow \lambda\boldsymbol{\beta} + (1-\lambda)\mathop{\text{desc\_sort}}(\boldsymbol{s}, \text{axis=0})_{[mk/n:mk/n+1]} \\ 3: & \text{Output } \boldsymbol{x},\boldsymbol{\beta} \end{array} \\ \hline \end{array}
In practice, we also face the problem that computing the 1-k/n quantile of \boldsymbol{s} is relatively expensive. It requires finding the (mk/n+1)-th largest element among m elements, where m equals the total number of samples times the sequence length. Due to various parallelism strategies and gradient accumulation constraints, an exact implementation is often unacceptable. Thus, we again consider a compromise: split the samples into the largest mini-batches we can afford, compute a separate \boldsymbol{\beta} for each mini-batch, and average them to obtain the final result.
Expert Choice
In fact, this dynamic version of QB has a very simple interpretation, namely “Expert Choice”:
Since you want each expert to be activated exactly mk/n times, why not let the experts choose the tokens? That is, each expert selects the Top-mk/n tokens. Isn’t this just finding the mk/n largest elements along \text{axis=0}?
However, naive Expert Choice has a fatal flaw: it requires comparing all tokens globally, leading to cross-sample and cross-sequence selection. This violates causality (seeing future information during training) and breaks training-inference consistency (inference results depend on batch size). That is precisely why this method has been difficult to implement in practice.
The ingenuity of this article lies in transforming it into a bias form: using the (mk/n+1)-th largest element (equivalently, the 1-k/n quantile) of each expert as the threshold \boldsymbol{\beta}, we reformulate Expert Choice as Token Choice—from “experts choose Top-mk/n” to “activate if s_{i,j} - \beta_j > 0”. More crucially, we delay the update of \boldsymbol{\beta} until after the activation decision, perfectly avoiding the information leakage problem.
Surprisingly, this bias form that fixes the flaws of Expert Choice is naturally embedded in the dual problem of the original linear program—one has to admit, it carries a pleasing aesthetic appeal.
Initialization Strategy
To prevent information leakage, during training we need to first use the old \boldsymbol{\beta} to decide which experts each token activates, and only then update \boldsymbol{\beta}. This makes the first few training steps sensitive to the initialization of \boldsymbol{\beta}. For example, if \boldsymbol{s} is entirely positive but \boldsymbol{\beta} is initialized to zero, then in the first training step each token will activate all experts, causing a computational explosion.
In A Tour of MoE: 4. Invest More Where It’s Difficult, we already provided a simulation script to estimate a suitable initialization. But here, since we already know the optimal \boldsymbol{\beta} is the 1-k/n quantile, we can make reasonable assumptions about the initial scores and derive an analytical initialization. Specifically, assuming the router’s initial logits follow \mathcal{N}(0,\sigma^2), its 1-k/n quantile is \sigma \cdot \Phi^{-1}\left(1 - \frac{k}{n}\right) This is precisely the optimal \boldsymbol{\beta} for the logits of this distribution, and we use it as the initialization for \boldsymbol{\beta}, where \Phi^{-1} is the quantile function of the standard normal distribution, also known as the inverse cumulative distribution function.
Additionally, the router may apply an activation function. If the activation function is element-wise and strictly increasing, such as Sigmoid, it does not alter the ordering, so we simply apply the corresponding activation to the above expression. For Softmax, it is slightly more complicated: it exponentiates and then normalizes. Assuming the normalization denominator is approximately constant, it becomes similar to the previous case; we only need to exponentiate the above expression and divide by the denominator estimated via quantile-based sampling: \frac{\exp\left[\sigma \cdot \Phi^{-1}\left(1 - \frac{k}{n}\right)\right]}{\sum\limits_{i=1}^n \exp\left[\sigma \cdot \Phi^{-1}\left(1 - \frac{i}{n+1}\right)\right]}
As for estimating \sigma, it is also straightforward. Assuming the router’s input dimension is d and RMS Norm is applied, and assuming the router’s weight matrix is initialized with \mathcal{N}(0,\tilde{\sigma}^2), then the router logits approximately follow a normal distribution with mean near 0 and variance about \tilde{\sigma}^2 d, i.e., \sigma\approx \tilde{\sigma} \sqrt{d}. These are classical results on initialization.[ref]
Demonstration Code
In this section, we again provide a demonstration code snippet for you to play with:
import numpy as np
def quantile_bias(s, k):
"""Dynamic QB: one-step quantile yields optimal bias.
Principle: https://kexue.fm/archives/11626
"""
m, n = s.shape
beta = np.quantile(s, 1 - k / n, axis=0)
return beta
def max_min_avg_vio(s):
"""Compute max_vio, min_vio, and avg_vio.
max_vio >= 0, avg_vio >= 0, -1 <= min_vio <= 0;
the closer to 0, the more balanced.
"""
m, n = s.shape
f = (s > 0).mean(0)
f = f / f.sum() * n - 1
return f.max(), f.min(), np.abs(f).mean()
def avg_std_active(s):
"""Compute the mean and standard deviation of activation counts for each expert.
avg closer to k is better, std closer to 0 is better.
"""
m, n = s.shape
f = (s > 0).mean(0) * n
return f.mean(), f.std()
m, n, k = 100000, 256, 8
s = np.random.rand(m, n) + np.random.rand(n) # simulate an uneven scoring
b = quantile_bias(s, k)
max_min_avg_vio(s - b) # should be all zeros if no mishap
avg_std_active(s - b) # should be (k, 0) if no mishap
from scipy.stats import norm
sigma = 1
s = np.random.randn(m, n) * sigma # simulate initialization
avg_std_active(s) # approximately (128, 0.4), far exceeding budget k
b0 = np.zeros(n) + norm.ppf(1 - k / n) * sigma # norm.ppf is the quantile function of standard normal
avg_std_active(s - b0) # approximately (8, 0.1), close to budget k
s2 = 1 / (1 + np.exp(-s)) # Sigmoid activation
b0_2 = 1 / (1 + np.exp(-b0)) # Sigmoid activation
avg_std_active(s2 - b0_2) # still approximately (8, 0.1), close to budget k
s3 = np.exp(s) / np.exp(s).sum(axis=1, keepdims=True) # Softmax activation
q = (np.arange(n) + 1) / (n + 1) # evenly spaced points (quantiles)
b0_3 = np.exp(b0) / np.exp(sigma * norm.ppf(q)).sum() # Exp activation divided by simulated denominator
avg_std_active(s3 - b0_3) # approximately (7.5, 0.1), close to budget k
Gradient Descent
Finally, if one wants to avoid quantile computation entirely, we can still use gradient descent. We first denote the loss function of objective [eq:beta-obj] as \ell: \min_{\beta_j} \underbrace{\sum_{i,j} \max(0, s_{i,j} - \beta_j) + \frac{mk}{n} \sum_j \beta_j}_{=:\ell} It is differentiable, and its gradient is \frac{\partial\ell}{\partial\beta_j} = \frac{mk}{n} - \sum_{i=1}^m \chi(s_{i,j} - \beta_j > 0) where \chi is the indicator function, \chi(\text{True})=1,\chi(\text{False})=0. Computing this gradient is cheaper than a global quantile, making it suitable for readers sensitive to computational cost. Once we have the gradient, we can perform gradient descent. We still consider SignSGD, aligning with the Loss-Free approach: \beta_j \leftarrow \beta_j - \gamma\mathop{\text{sign}}\left(\frac{\partial\ell}{\partial\beta_j}\right) This can replace the quantile update step for \boldsymbol{\beta}, and because it is inherently a long-term iterative algorithm, EMA can also be omitted. In fact, this is exactly formula (9) proposed in A Tour of MoE: 4. Invest More Where It’s Difficult, which we have re-derived from the perspective of “optimal assignment + dual objective + gradient descent”.
Conclusion
This article continues the exploration of Quantile Balancing (QB) from the previous article. By removing the constraint that “each token can only activate k experts” from the optimal assignment problem, we significantly simplifies the solution. Load balancing can be achieved with just a single quantile step. Each token only needs to check the sign of the biased scores to decide which experts to activate, eliminating the cost of Top-k sorting.
For reprinting, please include the address of this article: https://kexue.fm/archives/11626
For more details on reprinting, please refer to: Science Space FAQ.