MoE Tour: 4. Invest More in Difficulties · English (unofficial) translations of posts at kexue.fm

Design Philosophy

First, let’s review the basic form of MoE: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i \end{equation} Load imbalance is a common problem in MoE training. In response, researchers proposed Aux Loss, which we introduced in "MoE Tour: 2. Not Worried about Scarcity but about Inequality". In addition, in "MoE Tour: 3. A Different Approach to Allocation", we introduced the Loss-Free scheme proposed by DeepSeek, which changes MoE to: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argtop}}_k \boldsymbol{\rho} + \boldsymbol{b}} \rho_i \boldsymbol{e}_i \end{equation} Then, load balancing is achieved by adjusting the newly introduced Bias term \boldsymbol{b}. To allow each Token to select a dynamic number of Experts, the approach I propose is to slightly modify the Loss-Free form: \begin{equation} \boldsymbol{y} = \sum_{i\in \mathop{\mathrm{argwhere}}\boldsymbol{\rho} + \boldsymbol{b} > 0} \rho_i \boldsymbol{e}_i \end{equation} That is, as long as the Expert satisfies \rho_i + b_i > 0, it is selected. In this way, the number of Experts selected for each Token is naturally dynamic, and the need for sorting is eliminated, which to some extent makes it simpler.

Optimization Objectives

There are two optimization objectives for \boldsymbol{b}: one is to achieve load balancing, just like Loss-Free; the other is to control the average number of Experts selected for each Token to be k, which we can call budget control. Otherwise, simply setting b_i = \infty would select all Experts, but that is not what we want.

Load balancing still adopts the training method of Loss-Free. Define the notation \boldsymbol{f} = [f_1, f_2, \cdots, f_n]: \begin{equation} f_i = \left\{\begin{aligned}1, \quad \rho_i + b_i > 0 \\ 0, \quad \rho_i + b_i \leq 0\end{aligned}\right. \end{equation} Then let \tilde{\boldsymbol{F}}=\mathbb{E}[\boldsymbol{f}], so \boldsymbol{F} = \tilde{\boldsymbol{F}}/|\tilde{\boldsymbol{F}}| is the current Expert distribution, where |\tilde{\boldsymbol{F}}| is the sum of the components of \tilde{\boldsymbol{F}}. The update formula proposed by Loss-Free is: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})\label{eq:aux-loss-free} \end{equation} where \boldsymbol{Q}=(1/n, 1/n, \cdots, 1/n) is the target uniform distribution. As mentioned multiple times, \boldsymbol{b} has a redundant degree of freedom, reflected in the fact that adding the same constant to all components of \boldsymbol{b} does not change the sorting result. In this way, we can change the update rule [eq:aux-loss-free] to: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})}\right]\label{eq:aux-loss-free-2} \end{equation} Here, a bar over a vector represents the mean of all components of the vector, which is a scalar. A vector minus a scalar means each component is subtracted by that scalar. In this way, the resulting \boldsymbol{b} must satisfy \overline{\boldsymbol{b}}=0, but it does not change the load balancing effect. Thus, we can leave the degree of freedom \overline{\boldsymbol{b}} for budget control.

How to understand this? Obviously, if a positive number is added to all b_i, the probability of satisfying \rho_i + b_i > 0 will increase, thereby increasing the total budget. So the approach is simple: first calculate the current average budget, which happens to be |\tilde{\boldsymbol{F}}|. If it is greater than k, then decrease \boldsymbol{b} slightly; otherwise, increase it. Integrating this into equation [eq:aux-loss-free-2] gives: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})} + \mathop{\mathrm{sign}}(|\tilde{\boldsymbol{F}}|- k)\right]\label{eq:aux-loss-free-3} \end{equation} If we only want to ensure the budget does not exceed k, rather than necessarily equaling k, it can be changed to make no change when |\tilde{\boldsymbol{F}}| < k: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \left[\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q}) - \overline{\mathop{\mathrm{sign}}(\boldsymbol{F} - \boldsymbol{Q})} + \mathop{\mathrm{sign}}(\max(|\tilde{\boldsymbol{F}}|- k,0))\right]\label{eq:aux-loss-free-4} \end{equation}

Attempting Simplification

Savoring equation [eq:aux-loss-free-3], we find it does two things: one is to make \boldsymbol{F}=\tilde{\boldsymbol{F}}/|\tilde{\boldsymbol{F}}| approach \boldsymbol{Q}, and the other is to make |\tilde{\boldsymbol{F}}| approach k. This seems like it can be merged into one thing: making \tilde{\boldsymbol{F}} approach \tilde{\boldsymbol{Q}}=k\boldsymbol{Q}=(k/n,k/n,\cdots,k/n). Thus, equation [eq:aux-loss-free-3] can be simplified to: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha \mathop{\mathrm{sign}}(\tilde{\boldsymbol{F}} - \tilde{\boldsymbol{Q}})\label{eq:aux-loss-free-5} \end{equation}

I experimented with both equation [eq:aux-loss-free-3] and equation [eq:aux-loss-free-5] and found that they are largely similar in effect. However, the fluctuations of load balancing and budget control indicators in equation [eq:aux-loss-free-5] are much larger in the early stages of training. Therefore, readers pursuing stability can prioritize equation [eq:aux-loss-free-3] or [eq:aux-loss-free-4], while readers pursuing simplicity can consider equation [eq:aux-loss-free-5].

Considering that \mathop{\mathrm{sign}} only keeps the sign of \tilde{F}_i - \tilde{Q}_i and ignores the absolute value, I also tried replacing \mathop{\mathrm{sign}} with RMS Norm: \begin{equation} \boldsymbol{b}\leftarrow \boldsymbol{b} - \alpha (\tilde{\boldsymbol{F}} - \tilde{\boldsymbol{Q}})/\Vert\tilde{\boldsymbol{F}} - \tilde{\boldsymbol{Q}}\Vert_{RMS} \end{equation} where the \Vert\cdot\Vert_{RMS} of a vector refers to the square root of the mean of the squares of its components. Obviously, the RMS of \mathop{\mathrm{sign}} is 1, and after RMS Norm, the RMS is also 1, so the magnitude of updates for both is the same, and the same \alpha can be used. Since RMS Norm preserves the relative size of \tilde{F}_i - \tilde{Q}_i, it can achieve smaller updates for smaller errors, so it is slightly less volatile than \mathop{\mathrm{sign}}, but not by much.

Of course, using RMS Norm to replace \mathop{\mathrm{sign}} to increase stability is a general trick. Equations [eq:aux-loss-free], [eq:aux-loss-free-2], [eq:aux-loss-free-3], or [eq:aux-loss-free-4] can all undergo such a replacement. This depends on personal preference; in short, it is slightly more stable but not significantly so.

Initialization Method

After solving the update rule for \boldsymbol{b}, let’s consider the initialization of \boldsymbol{b}, which is an interesting but not critical issue.

According to conventional practice, if \boldsymbol{b} is initialized to all zeros and \boldsymbol{\rho} is activated with Sigmoid, then in the initial stage, all n Experts will be selected, which clearly exceeds the budget of \leq k. This will lead to many Token Drops. However, if we are not obsessive about it, this is not a very serious problem because other model parameters usually have Warmup but \boldsymbol{b} usually does not, so the model will automatically solve this problem in the first few steps of Warmup.

If we do mind this, we can control the initial budget by adjusting the initialization of \boldsymbol{b}. Assuming the input to the Router is a d-dimensional vector satisfying zero mean and unit variance (which approximately holds with RMSNorm), and the Router’s weight initialization variance is \sigma^2, then the Router’s Logits are approximately zero mean with variance \sigma^2 d. With this data, we can use a normal approximation simulation plus a bisection method to estimate an initial \boldsymbol{b}:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def b_init(n, k, d, sigma, eps=0.1):
    b1, b2 = -1, 0
    std = sigma * d**0.5
    logits = np.random.randn(10000, n) * std
    scores = sigmoid(logits)
    while True:
        b = (b1 + b2) * 0.5
        c = ((scores + b) > 0).sum(1).mean()
        if -eps < c - k < eps:
            return b
        elif c > k:
            b2 = b
        else:
            b1 = b

b_init(32, 4, 1024, 6e-3)

The code considers Sigmoid activation, so the search interval is [-1, 0]. If using other activation functions, please adjust accordingly. However, the suggestion here is the same as in "MoE Tour: 3. A Different Approach to Allocation", which is that \boldsymbol{\rho} for adding \boldsymbol{b} can uniformly use Sigmoid activation, while \boldsymbol{\rho} multiplied by the Expert can consider other activation functions.

Related Work

Before this article, some work had already attempted MoE designs with dynamic Expert selection. Below is a brief list of some work I found, along with some simple evaluations from my personal aesthetic perspective.

Relatively simple approaches are AdaMoE and MoE++. They mix some low-computation Experts into the Experts, such as blank Experts, copy Experts, and constant Experts, while also encouraging load balancing. Thus, when a Token selects these simple Experts, it is equivalent to selecting fewer other standard Experts, thereby indirectly achieving a dynamic number. The advantage of this is that it can reuse the existing Top-k MoE infrastructure, but it also lacks some flexibility.

Another simple idea is to change Top-k selection to Top-p, from "Harder Tasks Need More Experts: Dynamic Routing in MoE Models". This transition seems natural, but it actually has many problems. For example, it cannot accurately control the average budget because when \boldsymbol{\rho} is close to a uniform distribution, the proportion of Top-p will be very large. Therefore, the original paper added an entropy loss to keep \boldsymbol{\rho} away from the uniform distribution. Overall, I feel the problems it introduces are more significant than the gains.

A unique approach is Ada-K Routing, which adds a new module to predict the number of Experts to be activated and then uses reinforcement learning for training. This is theoretically sound but undoubtedly increases training complexity. DA-MoE uses Attention scores to identify important Tokens and allocate more Experts to them, but it feels not fundamental enough because "MoE" is in principle not limited to FFN layers; once used on Attention, wouldn’t there be no Attention scores available?

Formally, the most similar approach to this article might be ReMoE. It also selects Experts based on a zero threshold but chooses the Aux Loss method to achieve load balancing and budget control, while mixing the idea of manual gradient manipulation to control Aux Loss weights. Overall, it feels a bit more "pieced together." This article continues the Loss-Free idea, using the extra degree of freedom of \boldsymbol{b} to regulate this threshold, thereby achieving a dynamic number of Experts with minimal changes.

Summary

This article proposes an MoE design that dynamically selects the number of Experts. The main idea is to slightly modify the Loss-Free MoE form and then adjust the update rule of the Bias term, utilizing its extra degree of freedom to simultaneously achieve load balancing and budget control.

Reprinting please include the original address: https://kexue.fm/archives/10815

For more detailed reprinting matters, please refer to: "Scientific Space FAQ"