English (unofficial) translations of posts at kexue.fm
Source

MoE Journey: 8. Enforcing Sequence-Level Balance

Translated by DeepSeek V4 Pro. Translations can be inaccurate, please refer to the original post for important stuff.

So far, the “MoE Journey” series has published 7 articles, 5 of which revolve around MoE routing and load balancing. From the perspective of routing forms, they can be divided into static computation and dynamic computation; from the perspective of load balancing methods, they can be divided into Aux-Loss and Loss-Free categories, where Loss-Free can be further divided into the SignSGD scheme proposed by DeepSeek and the QB scheme proposed by the author.

Another detail is that the Loss-Free schemes we discussed previously all achieve balance at the global batch level, while in many cases we add a sequence-level Aux Loss to avoid extreme imbalance at the sequence level. If we believe that every Aux-Loss scheme has a corresponding Loss-Free version, then how should we implement sequence-level Loss-Free balancing? This is the topic to be explored in this article.

Review of Previous Work

First, let us review the previous discussions on load balancing. In the article MoE Journey: 2. Not Worrying About Scarcity but Inequality, we first discussed the load balancing problem of MoE, and the proposed solution was Aux Loss, i.e., adding an extra loss function to promote balance. The problem with Aux Loss is that there is an additional weight coefficient to tune, but the advantage is that the granularity of balance can be adjusted arbitrarily, enabling sequence-level, group-level, or global-level balance (of course, the more global, the greater the communication cost).

Starting from MoE Journey: 3. A Different Way to Allocate, we discussed Loss-Free load balancing strategies, which introduce an additional bias term to adjust the degree of balance and design a SignSGD-like rule to update the bias term manually. This approach is more flexible and infrastructure-friendly than Aux Loss. In MoE Journey: 4. More Investment in Difficult Parts, we extended it to MoE with dynamic activation counts.

In MoE Journey: 6. Optimal Allocation Promotes Balance, we viewed the load balancing problem from the perspective of optimal allocation. By alternately solving the dual objective, we obtained a new load balancing strategy called Quantile Balancing (QB). From the result alone, it can be seen as a new and more accurate solution for the bias term in Loss-Free schemes. Then, in MoE Journey: 7. A Minimalist Solution for Dynamic Activation, we similarly extended QB to dynamic MoE.

Here is a detail: the bias term introduced in Loss-Free is globally shared, so all current Loss-Free schemes can only achieve global load balancing. As mentioned earlier, if needed, Aux Loss can also be used to promote sequence-level balance. So, is there a corresponding Loss-Free scheme for sequence-level balance?

Local Centering

Coincidentally, @ChangJonathanC also explored this in the article Causal Routing Bias for Aux-Loss-Free MoE Training. The author proposed two ideas. Let us briefly study them to facilitate comparison with our own ideas later. Let \boldsymbol{s}\in\mathbb{R}^{l\times n} denote the Router output for a sequence of length l, and \boldsymbol{s}_i\in\mathbb{R}^n denote the Router output for the i-th token. The first idea proposed in the article is to subtract the exponential moving average (EMA) from \boldsymbol{s}: \hat{\boldsymbol{s}}_i = \boldsymbol{s}_i - \bar{\boldsymbol{s}}_i,\qquad \bar{\boldsymbol{s}}_i = \gamma\bar{\boldsymbol{s}}_{i-1} + (1 - \gamma)\boldsymbol{s}_i Then use \hat{\boldsymbol{s}} to decide which Experts to activate (e.g., select Top-k, or combine with QB, etc.). The idea behind this approach is very simple: intuitively, if Expert j is activated many times when using \boldsymbol{s} for decision-making, it means its scores \boldsymbol{s}_{:,j} are on average relatively large. Therefore, to achieve balance, it should be penalized accordingly, and the penalty amount is exactly the local average estimated by EMA.

In this way, in the new \hat{\boldsymbol{s}}, the scores of each Expert locally have approximately zero mean, preventing any Expert from being too “prominent”, thus achieving sequence-level balance. However, this is only an empirical approach and does not theoretically guarantee better balance. The author conducted simple tests, and in extremely imbalanced scenarios such as the first MoE layer, it did not bring substantial help. Perhaps the author also realized this limitation, so he proposed a second scheme.

Test-Time Training

The idea of the second scheme is also quite intuitive. In MoE Journey: 3. A Different Way to Allocate, didn’t we introduce a bias term to control global balance? This bias term is updated using SignSGD. In that case, we can imitate the “test-time training” idea from TTT. Taking \boldsymbol{s}\in\mathbb{R}^{l\times n} and the Top-k MoE as an example, the original Loss-Free scheme introduces a global bias vector \boldsymbol{\beta}\in\mathbb{R}^n, and then changes the activation mechanism to selecting Top-k from \boldsymbol{s}_i - \boldsymbol{\beta}. The idea here is to assign a \boldsymbol{\beta}_i\in\mathbb{R}^n to each \boldsymbol{s}_i, and change the activation mechanism to selecting Top-k from \boldsymbol{s}_i - \boldsymbol{\beta}_i. The update rule for \boldsymbol{\beta}_i can be signed gradient ascent: \boldsymbol{\beta}_i = \boldsymbol{\beta}_{i-1} + \eta\mathop{\text{sign}}(\boldsymbol{f}_{\leq i-1} - 1/n) Here \boldsymbol{f}_{\leq i-1} is the distribution of activated Experts up to the (i-1)-th token. The author’s final version made some adjustments: removed the sign function and changed \boldsymbol{f}_{\leq i-1} to \boldsymbol{f}_{i-1} (the activation distribution of the (i-1)-th token) to better achieve local balance: \boldsymbol{\beta}_i = \boldsymbol{\beta}_{i-1} + \eta(\boldsymbol{f}_{i-1} - 1/n) This idea can maintain the causality of the sequence and also ensure almost absolute sequence-level balance. However, since each step requires selecting Top-k and updating the bias term based on the result \boldsymbol{f}, this essentially introduces a nonlinear RNN. Nonlinear RNNs cannot be parallelized, so it is foreseeable that long sequences will become a bottleneck. Of course, we can also consider updating by chunks (Mini-batch TTT) to improve efficiency, but this becomes slightly less elegant.

The Optimal Solution

Next, the idea proposed in this article is called “Moving Quantile Balancing (MQB)”. As the name suggests, it evolves from “Quantile Balancing (QB)”. To this end, let us briefly review QB.

According to the order in which the author wrote, MoE Journey: 6. Optimal Allocation Promotes Balance came before MoE Journey: 7. A Minimalist Solution for Dynamic Activation, but in fact the latter is much simpler both conceptually and methodologically, so we will start from the latter. Given \boldsymbol{s}\in\mathbb{R}^{l\times n}, we introduce a bias term \boldsymbol{\beta}\in\mathbb{R}^n. In the static version of MoE, each token activates the k Experts with the largest \boldsymbol{s}_i - \boldsymbol{\beta}. Dynamic activation means activating all Experts for which \boldsymbol{s}_i - \boldsymbol{\beta} > 0, with an indefinite number.

To simultaneously ensure load balancing and control the average number of activations to k, an appropriate \boldsymbol{\beta} must be chosen. Remarkably, in this scenario, the optimal solution for \boldsymbol{\beta} can be expressed exactly: \boldsymbol{\beta} = \mathop{\text{desc\_sort}}(\boldsymbol{s}, \text{axis=0})_{[lk/n:lk/n+1]} = \mathop{\text{quantile}}(\boldsymbol{s}, 1-k/n, \text{axis=0}) That is, sort \boldsymbol{s} in descending order along the sequence dimension (here axis=0), and take the element at the lk/n-th position; this is the optimal solution for \boldsymbol{\beta}, which also corresponds to the “1-k/n quantile”.

In fact, this is just another way of expressing Expert Choice. We know that the problem with Expert Choice is that it breaks causality (non-causal), which can also be seen from the fact that it computes \boldsymbol{\beta} along the sequence dimension. If we are in an Encoder scenario, this non-causal scheme is naturally not a problem, but it is not acceptable for unidirectional autoregressive models.

Moving Quantile

The “test-time training” scheme by @ChangJonathanC earlier gave the author some inspiration. We can compute the bias term token by token along the sequence dimension based on QB. Since QB can directly express the optimal bias term based on \boldsymbol{s} without knowing the Expert activation status, this brings some possibilities for parallelization.

The author’s initial idea was \boldsymbol{\beta}_i = \mathop{\text{quantile}}(\boldsymbol{s}_{[:i]}, 1-k/n, \text{axis=0}), i.e., compute the bias term using the scores of all tokens up to the current position. Then each position has its own bias term, and the optimality of QB ensures balance up to the current position. Since it requires progressively computing quantiles along the sequence dimension, we may call this operation “Cumulative Quantile”, and the corresponding scheme “Cumulative Quantile Balancing (CQB)”.

CQB is feasible in principle, but since quantile cannot be updated incrementally, each step of CQB requires reading the entire data to compute the quantile, resulting in linearly increasing computational complexity and an overall quadratic complexity. To alleviate this problem and better reflect local balance, the author’s idea is to set a window w, and for each token, compute the optimal bias term only within the window of size w, i.e., \boldsymbol{\beta}_i = \mathop{\text{quantile}}(\boldsymbol{s}_{[i-w:i]}, 1-k/n, \text{axis=0}) Thus the operation changes from “Cumulative Quantile” to “Moving Quantile”, and the corresponding scheme becomes the prototype of “Moving Quantile Balancing (MQB)”. It is very similar to SWA (Sliding Window Attention), with linear complexity and theoretically parallelizable. The reason it is still a “prototype” is that quantile still cannot be updated incrementally, making the per-step complexity relatively high and still in need of improvement.

Bucket Estimation

The various obstacles mentioned in the previous section are essentially limited by the fact that “quantile is nonlinear and cannot be updated incrementally”. Is there a way to linearize it? Even approximately? Fortunately, it can indeed be done!

Schematic diagram of quantile estimation via histogram approximation

A key fact is: as long as we know the distribution of the variable, we can find the quantile via the cumulative probability, and the distribution can be updated incrementally! Therefore, the key here is to transform the estimation of quantiles into the estimation of the distribution. For this, we use bucket counting, also known as “histogram approximation”. First, assume that the Router scores \boldsymbol{s} are all within [0, 1], which can be ensured by adding an activation function such as Sigmoid. Next, divide [0, 1] into b equal buckets, discretize each value of \boldsymbol{s} into an integer, and then convert it into the corresponding one-hot vector.

In this way, we obtain a l\times n\times b array of 0/1. If we average it along the sequence dimension l, we can obtain the score distribution of each Expert (represented as a b-dimensional vector), and then compute the quantile for each Expert. However, to ensure causality, we cannot directly average over the entire sequence; we can only perform an operation similar to “Cumulative Average”. To avoid the trouble caused by the strict boundary of “Moving Quantile” and to make the whole process more “smooth”, we choose to use EMA here.

That is, after converting \boldsymbol{s} into bucketed one-hot vectors, we apply EMA along the sequence dimension. Each token obtains a local distribution, from which the corresponding 1-k/n quantile can be computed. This is the key variable \boldsymbol{\beta}_i of MQB. Since the EMA operation is linear and parallelizable, we have achieved sequence-level balance in a relatively efficient, Loss-Free manner. This is the final version of MQB.

\begin{array}{|l|} \hline \text{Moving Quantile Balancing (MQB)} \\[4pt] \hline \text{Input: score matrix }\boldsymbol{s}\in [0,1]^{l\times n} \\ \text{Output: corrected score matrix }\hat{\boldsymbol{s}}\in \mathbb{R}^{l\times n} \\[4pt] \hline \begin{array}{ll} 1: & \boldsymbol{h}_{i,j} = \mathop{\text{one\_hot}}(\lfloor s_{i,j} \times b\rfloor)\in\{0,1\}^b \\ 2: & \bar{\boldsymbol{h}}_{i,j} = \gamma\bar{\boldsymbol{h}}_{i-1,j} + (1-\gamma)\boldsymbol{h}_{i,j} \\ 3: & m_{i,j}^* = \min\{m \,|\, \sum_{t=0}^m \bar{h}_{i,j,t} \geq 1 - \frac{k}{n}\} \\ 4: & \beta_{i,j} =(m_{i,j}^* + 1/2) / b \\ 5: & \hat{\boldsymbol{s}}_i = \boldsymbol{s}_i - \boldsymbol{\beta}_i, \end{array} \\ \hline \end{array}

General Case

So far, our discussion has been based on the dynamic activation version of MoE, i.e., activating Experts with \boldsymbol{s}_i - \boldsymbol{\beta}_i > 0, with an indefinite number. From MoE Journey: 6. Optimal Allocation Promotes Balance, we know that for the Top-k version of MoE, \boldsymbol{\beta} does not have a simple analytical solution and requires alternating iterations.

Then how can the Top-k version of MoE achieve sequence-level load balancing in a Loss-Free manner? In fact, MQB can achieve sequence-level balance under dynamic activation, indicating that it has already “flattened” the local abnormal peaks of the Router. At this point, if we take Top-k from \boldsymbol{s}_i - \boldsymbol{\beta}_i, it may still be imbalanced, but this imbalance is at most at the global level. Therefore, we only need to add a step of global balancing QB to restore balance.

In other words, for the Top-k version of MoE, our enforced sequence-level balancing scheme is MQB+QB: apply another QB (or SignSGD, of course) on top of the \boldsymbol{s}_i - \boldsymbol{\beta}_i sequence to achieve sequence-level load balancing. However, it should be noted that perfect sequence-level load balancing often significantly harms performance. In general, global balance is sufficient, and sequence-level balance is only used to prevent extreme cases, so the degree of enforcement should not be too heavy.

To reduce the impact on the main task, the Aux Loss scheme can choose to lower the loss coefficient, while MQB can consider taking a weighted average of the scores before and after: \lambda(\boldsymbol{s}_i - \boldsymbol{\beta}_i) + (1-\lambda)\boldsymbol{s}_i = \boldsymbol{s}_i - \lambda \boldsymbol{\beta}_i,\qquad \lambda\in[0, 1] That is, by introducing \lambda < 1 to weaken the degree of sequence balance, and then superimposing QB on top to ensure global balance (since \lambda < 1 cannot guarantee balance, at this time both the dynamic version and the Top-k version need to add QB).

Other Details

After bucketing, the core operation of MQB also becomes EMA, so after going around in circles, we are back to EMA, similar in form to the first scheme proposed by @ChangJonathanC. The difference lies in the object of EMA. The scheme by @ChangJonathanC directly applies EMA to the original scores of size l\times n, while MQB expands it to l\times n\times b before applying EMA.

This change is somewhat like how linear attention expands capacity through outer products, allowing more information to be remembered. Moreover, the optimality of QB ensures that it can achieve sequence-level balance, no longer remaining an empirical approach. Another similarity to linear attention is that EMA is essentially an RNN, so during inference, an additional state vector of size n\times b needs to be stored. In practice, it is found that b=100 already yields good results, so this state size should be acceptable.

It should also be pointed out that whether it is QB or MQB, they are only used for the activation decision of Experts. The final multiplication by the Expert gating function is still based on the original Router scores before subtracting the bias term. This is a common feature of all Loss-Free schemes, ensuring that the balancing intervention does not directly change the form of MoE, in order to minimize the impact on the main model.

Finally, regarding the bucketing method, the author’s current experiments use Sigmoid activation followed by uniform discretization in [0,1] for bucketing. How to perform finer bucketing may still have some experimental room, which is left to interested readers to complete.

Experimental Results

For MQB, the author conducted some simple experiments. The configuration is a MoE model with a total parameter count of about 3B, selecting 4 out of 128 Experts. The EMA coefficient of MQB is 0.99, the default number of buckets is 100, and MaxVio is used as the load balancing metric. Since the first MoE layer is the most imbalanced, we will only demonstrate the results of the first layer below.

First, the full version of MQB with \lambda=1:

Comparison of MQB (\lambda=1), QB, and SignSGD

It can be seen that \lambda=1 achieves very perfect load balancing, at the cost of a Loss difference of 0.06, which is a very significant drop, indicating that excessive sequence balance is detrimental to performance. However, if \lambda is adjusted to around 0.3, the Loss basically does not drop, and it can still promote load balancing, as shown in the figure below.

Load balancing for different \lambda values

Extended Thoughts

From the above experiments, we can further confirm that perfect sequence-level balance (\lambda=1) severely damages performance. If we take \lambda=0.3, we can roughly ensure no loss in performance while improving the balance of each layer.

In summary, this article only answers the question of “how to achieve sequence-level load balancing in a Loss-Free manner”. However, whether sequence-level load balancing is needed, why it is needed, and to what degree it is needed, we cannot provide a standard answer. A preliminary understanding is that extreme imbalance at the sequence level may affect the performance of the sequence itself (collapsing into a small model), so we hope to encourage a certain degree of sequence-level balance.

In addition to MQB, there are actually some other schemes that can also achieve sequence-level balance in a Loss-Free manner. For example, the Hash Routing introduced in How did DeepSeek V4’s tid2eid come about? can also be considered a simple scheme for enforcing sequence balance. However, Hash Routing has redefined the routing form and belongs to an unconventional route.

The value of a scheme like MQB lies in its better compatibility with conventional MoE forms, and it transforms “adjusting the coefficient of Aux Loss” into “adjusting an interpretable local bias term”, allowing us to control the intervention strength more intuitively, while its parallelizability ensures efficiency. Since it only affects Router decisions without injecting gradients, its impact on the main task is also minimized as much as possible.

Conclusion

This article revolves around “how to achieve sequence-level load balancing in a Loss-Free manner”. Starting from the original Quantile Balancing (QB), we gradually derived a method called Moving Quantile Balancing (MQB), successfully achieving this goal. However, whether sequence-level balance is truly necessary and to what extent it should be enforced remains an open question.

For reprinting, please include the address of this article: https://kexue.fm/archives/11760

For more detailed reprinting matters, please refer to: Science Space FAQ