In the previous two articles, we discussed load balancing. In "MoE Tour: 3. A Different Approach to Allocation", while introducing the Loss-Free scheme, I left a suspense: the introduced Bias term has a redundant degree of freedom, which can be used for other interesting things. In this article, we will discuss this matter.
We know that MoE selects only the k most matching Experts for each Token to perform calculations, thereby increasing the number of parameters while saving computation. However, when we think carefully, we find that this strategy actually has obvious room for improvement: intuitively, the difficulty of each Token is different, so a more reasonable scheme should be to allocate more computing resources to difficult Tokens and fewer resources to simple tokens, which might maximize the effect under the same limited resources.
The extra degree of freedom of the Bias mentioned just now happens to be a simple way to achieve this goal.
Design Philosophy
First, let’s review the basic form of MoE: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i \end{equation} Load imbalance is a common problem in MoE training. In response, researchers proposed Aux Loss, which we introduced in "MoE Tour: 2. Not Worried about Scarcity but about Inequality". In addition, in "MoE Tour: 3. A Different Approach to Allocation", we introduced the Loss-Free scheme proposed by DeepSeek, which changes MoE to: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argtop}}_k \boldsymbol{\rho} + \boldsymbol{b}} \rho_i \boldsymbol{e}_i \end{equation} Then, load balancing is achieved by adjusting the newly introduced Bias term \boldsymbol{b}. To allow each Token to select a dynamic number of Experts, the approach I propose is to slightly modify the Loss-Free form: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argwhere}}\boldsymbol{\rho} + \boldsymbol{b} > 0} \rho_i \boldsymbol{e}_i \end{equation} That is, as long as the Expert satisfies \rho_i + b_i > 0, it is selected. In this way, the number of Experts selected for each Token is naturally dynamic, and the need for sorting is eliminated, which to some extent makes it simpler.
Optimization Objectives
There are two optimization objectives for \boldsymbol{b}: one is to achieve load balancing, just like Loss-Free; the other is to control the average number of Experts selected for each Token to be k, which we can call budget control. Otherwise, simply setting b_i = \infty would select all Experts, but that is not what we want.
Load balancing still adopts the training method of Loss-Free. Define the notation \boldsymbol{f} = [f_1, f_2, \cdots, f_n]: \begin{equation} f_i = \left\{\begin{aligned}1, \quad \rho_i + b_i > 0 \\ 0, \quad \rho_i + b_i \leq 0\end{aligned}\right. \end{equation} Then let \tilde{\boldsymbol{F}}=\mathbb{E}[\boldsymbol{f}], so \boldsymbol{F} = \tilde{\boldsymbol{F}}/|\tilde{\boldsymbol{F}}| is the current Expert distribution, where |\tilde{\boldsymbol{F}}| is the sum of the components of \tilde{\boldsymbol{F}}. The update formula proposed by Loss-Free is: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})\label{eq:aux-loss-free} \end{equation} where \boldsymbol{Q}=(1/n, 1/n, \cdots, 1/n) is the target uniform distribution. As mentioned multiple times, \boldsymbol{b} has a redundant degree of freedom, reflected in the fact that adding the same constant to all components of \boldsymbol{b} does not change the sorting result. In this way, we can change the update rule [eq:aux-loss-free] to: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})}\right]\label{eq:aux-loss-free-2} \end{equation} Here, a bar over a vector represents the mean of all components of the vector, which is a scalar. A vector minus a scalar means each component is subtracted by that scalar. In this way, the resulting \boldsymbol{b} must satisfy \overline{\boldsymbol{b}}=0, but it does not change the load balancing effect. Thus, we can leave the degree of freedom \overline{\boldsymbol{b}} for budget control.
How to understand this? Obviously, if a positive number is added to all b_i, the probability of satisfying \rho_i + b_i > 0 will increase, thereby increasing the total budget. So the approach is simple: first calculate the current average budget, which happens to be |\tilde{\boldsymbol{F}}|. If it is greater than k, then decrease \boldsymbol{b} slightly; otherwise, increase it. Integrating this into equation [eq:aux-loss-free-2] gives: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})} + \mathop{\mathrm{sign}}(|\tilde{\boldsymbol{F}}|- k)\right]\label{eq:aux-loss-free-3} \end{equation} If we only want to ensure the budget does not exceed k, rather than necessarily equaling k, it can be changed to make no change when |\tilde{\boldsymbol{F}}| < k: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})} + \mathop{\mathrm{sign}}(\max(|\tilde{\boldsymbol{F}}|- k,0))\right]\label{eq:aux-loss-free-4} \end{equation}
Attempting Simplification
Savoring equation [eq:aux-loss-free-3], we find it does two things: one is to make \boldsymbol{F}=\tilde{\boldsymbol{F}}/|\tilde{\boldsymbol{F}}| approach \boldsymbol{Q}, and the other is to make |\tilde{\boldsymbol{F}}| approach k. This seems like it can be merged into one thing: making \tilde{\boldsymbol{F}} approach \tilde{\boldsymbol{Q}}=k\boldsymbol{Q}=(k/n,k/n,\cdots,k/n). Thus, equation [eq:aux-loss-free-3] can be simplified to: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \mathop{\mathrm{sign}}(\tilde{\boldsymbol{F}} - \tilde{\boldsymbol{Q}})\label{eq:aux-loss-free-5} \end{equation}
I experimented with both equation [eq:aux-loss-free-3] and equation [eq:aux-loss-free-5] and found that they are largely similar in effect. However, the fluctuations of load balancing and budget control indicators in equation [eq:aux-loss-free-5] are much larger in the early stages of training. Therefore, readers pursuing stability can prioritize equation [eq:aux-loss-free-3] or [eq:aux-loss-free-4], while readers pursuing simplicity can consider equation [eq:aux-loss-free-5].
Considering that \mathop{\mathrm{sign}} only keeps the sign of \tilde{F}_i - \tilde{Q}_i and ignores the absolute value, I also tried replacing \mathop{\mathrm{sign}} with RMS Norm: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha (\tilde{\boldsymbol{F}} - \tilde{\boldsymbol{Q}})/\Vert\tilde{\boldsymbol{F}} - \tilde{\boldsymbol{Q}}\Vert_{RMS} \end{equation} where the \Vert\cdot\Vert_{RMS} of a vector refers to the square root of the mean of the squares of its components. Obviously, the RMS of \mathop{\mathrm{sign}} is 1, and after RMS Norm, the RMS is also 1, so the magnitude of updates for both is the same, and the same \alpha can be used. Since RMS Norm preserves the relative size of \tilde{F}_i - \tilde{Q}_i, it can achieve smaller updates for smaller errors, so it is slightly less volatile than \mathop{\mathrm{sign}}, but not by much.
Of course, using RMS Norm to replace \mathop{\mathrm{sign}} to increase stability is a general trick. Equations [eq:aux-loss-free], [eq:aux-loss-free-2], [eq:aux-loss-free-3], or [eq:aux-loss-free-4] can all undergo such a replacement. This depends on personal preference; in short, it is slightly more stable but not significantly so.
Initialization Method
After solving the update rule for \boldsymbol{b}, let’s consider the initialization of \boldsymbol{b}, which is an interesting but not critical issue.
According to conventional practice, if \boldsymbol{b} is initialized to all zeros and \boldsymbol{\rho} is activated with Sigmoid, then in the initial stage, all n Experts will be selected, which clearly exceeds the budget of \leq k. This will lead to many Token Drops. However, if we are not obsessive about it, this is not a very serious problem because other model parameters usually have Warmup but \boldsymbol{b} usually does not, so the model will automatically solve this problem in the first few steps of Warmup.
If we do mind this, we can control the initial budget by adjusting the initialization of \boldsymbol{b}. Assuming the input to the Router is a d-dimensional vector satisfying zero mean and unit variance (which approximately holds with RMSNorm), and the Router’s weight initialization variance is \sigma^2, then the Router’s Logits are approximately zero mean with variance \sigma^2 d. With this data, we can use a normal approximation simulation plus a bisection method to estimate an initial \boldsymbol{b}:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def b_init(n, k, d, sigma, eps=0.1):
b1, b2 = -1, 0
std = sigma * d**0.5
logits = np.random.randn(10000, n) * std
scores = sigmoid(logits)
while True:
b = (b1 + b2) * 0.5
c = ((scores + b) > 0).sum(1).mean()
if -eps < c - k < eps:
return b
elif c > k:
b2 = b
else:
b1 = b
b_init(32, 4, 1024, 6e-3)The code considers Sigmoid activation, so the search interval is [-1, 0]. If using other activation functions, please adjust accordingly. However, the suggestion here is the same as in "MoE Tour: 3. A Different Approach to Allocation", which is that \boldsymbol{\rho} for adding \boldsymbol{b} can uniformly use Sigmoid activation, while \boldsymbol{\rho} multiplied by the Expert can consider other activation functions.
Summary
This article proposes an MoE design that dynamically selects the number of Experts. The main idea is to slightly modify the Loss-Free MoE form and then adjust the update rule of the Bias term, utilizing its extra degree of freedom to simultaneously achieve load balancing and budget control.
Reprinting please include the original address: https://kexue.fm/archives/10815
For more detailed reprinting matters, please refer to: "Scientific Space FAQ"