In this article, we continue to explore the load balancing problem of MoE (Mixture of Experts). In the previous article "MoE Tour: 2. Not Scarcity but Inequality", we mainly discussed the idea of promoting load balancing through Auxiliary Loss (Aux Loss). While Aux Loss is simple and intuitive, it has a significant drawback: the weight is difficult to tune. If set too low, it fails to promote balance; if set too high, it easily damages the Language Model (LM) Loss. Therefore, the industry has been searching for alternative solutions.
The solution we are sharing today is called "Loss-Free," proposed by DeepSeek in "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts". Compared to DeepSeek’s many dazzling open-source works, this paper might not seem as prominent, but in my view, its potential academic influence may far exceed other works. The proposed method is not only simple and effective but also highly universal, making it a classic.
Main Idea
Faced with load imbalance, the approach of Aux Loss is to guide the Router to give balanced scores through an additional loss. In contrast, the idea of Loss-Free is to change the allocation logic itself. That is, instead of changing the Router’s existing scoring results, it changes the allocation method \mathop{\text{argtop}}_k \boldsymbol{\rho}.
Actually, there have been previous efforts in this direction. For example, in 2021, Facebook proposed the BASE Layer, which treats Expert allocation as a linear assignment problem. This involves finding an allocation that maximizes the total Router score under the constraint of load balancing, which can be solved using the Hungarian algorithm. However, this scheme requires knowing the scores of all tokens, so for auto-regressive LLMs, it is only applicable during training. Inference still has to use \mathop{\text{argtop}}_k \boldsymbol{\rho}, leading to an inconsistency between training and inference. Furthermore, due to current algorithm limitations, it only works for the k=1 scenario.
In comparison, the Loss-Free approach is very simple and effective. It notes the fact that we can always introduce a bias term \boldsymbol{b} such that the allocation \mathop{\text{argtop}}_k (\boldsymbol{\rho} + \boldsymbol{b}) is balanced. Thus, it modifies the MoE form as follows: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i \qquad\to\qquad \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k (\boldsymbol{\rho} + \boldsymbol{b})} \rho_i \boldsymbol{e}_i \end{equation} Here, \boldsymbol{b} is an input-independent vector determined during the training process. Once training is complete, it remains fixed, so it can also be used during the inference stage. In other words, training and inference share a consistent form. Note that the term multiplied by \boldsymbol{e}_i is still \rho_i, not \rho_i + b_i. This means \boldsymbol{b} only participates in the allocation process and not in the forward calculation of the MoE. Therefore, we have no special requirements for the sign of \boldsymbol{b} or \boldsymbol{\rho} + \boldsymbol{b}.
Hand-Crafting the Gradient
How do we train \boldsymbol{b}? We know that the optimization direction for \boldsymbol{b} is naturally to promote load balancing. To this end, following the notation of the previous article, we first define \boldsymbol{f}=[f_1, f_2, \dots, f_n]: \begin{equation} f_i = \left\{\begin{aligned} 1/k, & \quad i \in \mathop{\text{argtop}}\nolimits_k (\boldsymbol{\rho} + \boldsymbol{b}) \\ 0, & \quad i \notin \mathop{\text{argtop}}\nolimits_k (\boldsymbol{\rho} + \boldsymbol{b}) \end{aligned}\right. \end{equation} And let \boldsymbol{F} = \mathbb{E}[\boldsymbol{f}]. Here, \boldsymbol{F} is naturally the current load distribution of Experts under the bias \boldsymbol{b}. If we define the uniform distribution as \boldsymbol{Q} = (1/n, 1/n, \dots, 1/n), then load balancing is equivalent to minimizing: \begin{equation} \mathcal{L}_{\text{aux}} = \frac{1}{2}\Vert\boldsymbol{F} - \boldsymbol{Q}\Vert^2 = \frac{1}{2}\sum_{i=1}^n (F_i - 1/n)^2 \end{equation} This objective is non-differentiable. However, with the experience from the previous article, we know that the Straight-Through Estimator (STE) can solve this. The key to STE is to find a differentiable quantity that has the same trend as \boldsymbol{F} to serve as a smooth approximation. Here, our only optimization parameter is \boldsymbol{b}, and it happens to have the desired property (increasing b_i increases the probability of i being selected, thus increasing F_i). So the answer is clear: \begin{equation} \mathcal{L}_{\text{aux}} = \frac{1}{2}\Vert\boldsymbol{b} + \text{sg}[\boldsymbol{F}-\boldsymbol{b}] - \boldsymbol{Q}\Vert^2 = \frac{1}{2}\sum_{i=1}^n (b_i + \text{sg}[F_i - b_i] - 1/n)^2 \end{equation} Its gradient is: \begin{equation} \nabla_{\boldsymbol{b}}\mathcal{L}_{\text{aux}} = \frac{1}{2}\nabla_{\boldsymbol{b}}\Vert\boldsymbol{b} + \text{sg}[\boldsymbol{F}-\boldsymbol{b}] - \boldsymbol{Q}\Vert^2 = \boldsymbol{F} - \boldsymbol{Q} \end{equation} So, using Stochastic Gradient Descent (SGD) to update \boldsymbol{b} gives: \begin{equation} \boldsymbol{b} \leftarrow \boldsymbol{b} - \alpha (\boldsymbol{F} - \boldsymbol{Q}) \end{equation} where \alpha is the learning rate for \boldsymbol{b}. However, the update rule finally chosen by Loss-Free is slightly different; it chooses Sign Gradient Descent (SignSGD): \begin{equation} \boldsymbol{b} \leftarrow \boldsymbol{b} - \alpha \mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q}) \label{eq:aux-loss-free} \end{equation} This result is also easy to understand: if F_i is larger than 1/n, then decrease b_i slightly; otherwise, increase b_i slightly.
Improved Version
In addition to SignSGD, I found that applying RMS Norm to \boldsymbol{F} - \boldsymbol{Q} (i.e., Normalized SGD) often achieves better balancing effects under the same \alpha: \begin{equation} \boldsymbol{b} \leftarrow \boldsymbol{b} - \alpha \frac{\boldsymbol{F} - \boldsymbol{Q}}{\text{RMS}(\boldsymbol{F} - \boldsymbol{Q})} \end{equation} where RMS stands for "Root Mean Square," defined as: \begin{equation} \text{RMS}(\boldsymbol{F} - \boldsymbol{Q}) = \sqrt{\frac{1}{n}\sum_{i=1}^n (F_i - Q_i)^2} \end{equation} It is easy to see that both \mathop{\text{sign}}(\boldsymbol{F} - \boldsymbol{Q}) and \frac{\boldsymbol{F} - \boldsymbol{Q}}{\text{RMS}(\boldsymbol{F} - \boldsymbol{Q})} have an RMS of 1. Therefore, they are roughly on the same scale, allowing us to use the same \alpha.
Simply put, the problem with \mathop{\text{sign}} is that it uses the same update magnitude regardless of how far F_i is from the target Q_i. This causes F_i values that were already close to Q_i to deviate from the achieved balance, leading to oscillations. RMS Norm, however, preserves the relative differences between F_i - Q_i, making the update magnitude more adaptive. Theoretically, this is more conducive to promoting balance, and empirical tests often show it performs better.
Relevant Details
Although Loss-Free is simple and clear, there are some details to note when using it.
First, for each batch of data, we should first update the model parameters based on the LM Loss, and then update \boldsymbol{b} according to Eq. [eq:aux-loss-free]. This is because the update of \boldsymbol{b} depends on the statistical information \boldsymbol{F} of all tokens. Updating \boldsymbol{b} before the rest of the model parameters could, in principle, risk leaking future information. Although a single vector \boldsymbol{b} likely won’t leak much information, the risk exists, so it should be avoided.
Second, we mentioned that the original paper tuned \alpha=0.001, but this result might be tied to the choice of Sigmoid as the activation function for the Router \boldsymbol{\rho}. After Sigmoid, each \rho_i is relatively independent and within (0,1). \alpha=0.001 means the update magnitude per step is about one-thousandth. If Softmax, ReLU, or other activation functions are used, \alpha might need to be retuned.
To address this, I suggest decoupling the activation functions used for the Gate and the Bias: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k (\boldsymbol{\rho}^{(\sigma)} + \boldsymbol{b})} \rho_i^{(h)} \boldsymbol{e}_i \end{equation} where \boldsymbol{\rho}^{(\sigma)} = \sigma(\boldsymbol{x}\boldsymbol{W}^{(R)}) and \boldsymbol{\rho}^{(h)} = h(\boldsymbol{x}\boldsymbol{W}^{(R)}). Here \sigma(\cdot) is the Sigmoid function, and h(\cdot) is any monotonic function with a non-negative range. Simply put, the scores added to \boldsymbol{b} are Sigmoid-activated, allowing us to reuse \alpha=0.001. For the Gate multiplied by the Expert, we can use other activation functions, as long as their monotonicity is consistent with Sigmoid.
Furthermore, because the update rule [eq:aux-loss-free] uses the \text{sign} function, it is possible to train b_i values with absolute values greater than 1, and the overall magnitude might keep increasing. This is normal and will not affect the model’s performance. In fact, \boldsymbol{b} has a redundant degree of freedom because adding the same constant to all b_i does not change the result of \mathop{\text{argtop}}_k (\boldsymbol{\rho} + \boldsymbol{b}). This extra degree of freedom can be used for other interesting things (to be discussed next time).
Extended Thinking
Beyond MoE load balancing, the idea of Loss-Free can be applied to many similar problems. For example, Codebook Collapse in VQ-VAE can be solved with the same logic, which appears more natural and universal than previously introduced "rotation tricks" or "linear transformation tricks." In fact, the evaluation at the beginning of this article—that "Loss-Free’s potential academic influence may far exceed other works"—is based on this universality.
Setting aside specific application backgrounds, from a mathematical perspective, the contribution of Loss-Free can be understood as providing a method to solve assignment problems using gradient descent. A classic linear assignment problem can be expressed as: \begin{equation} \min_f \sum_{i=1}^n c_{i, f(i)} \end{equation} where c_{i,j} is a given cost function and f is a bijection from \{1, 2, \dots, n\} to itself. In the context of this article, c_{i,j} corresponds to the scores of n tokens and n experts, and the sought f is a load-balanced allocation scheme. The general idea for solving such problems is to search for the most optimal solution within the space that satisfies the constraints. Loss-Free, however, does the opposite: it first constructs an optimal solution that may not satisfy the constraints: \begin{equation} f(i) = \mathop{\text{argmin}}_j c_{i,j} \end{equation} This solution is certainly optimal in terms of scores but may not satisfy the bijection condition (which is equivalent to load imbalance). Thus, we introduce a bias: \begin{equation} f(i) = \mathop{\text{argmin}}_j (c_{i,j} + b_j) \end{equation} \boldsymbol{b} is initialized to zero and then updated according to Eq. [eq:aux-loss-free]. The update rule essentially says: if a certain j appears too frequently, decrease the corresponding b_j; otherwise, increase it, until a bijection is achieved.
Summary
This article introduced the Loss-Free method for MoE load balancing, proposed by DeepSeek. Its core lies in achieving load balancing by introducing a simple bias term. This article further explored its connection with Aux Loss and its potential application in similar mathematical problems.
Reprinting please include the original address: https://kexue.fm/archives/10757
For more details on reprinting, please refer to: "Scientific Space FAQ"