MoE Tour: 1. From a Geometric Perspective · English (unofficial) translations of posts at kexue.fm

Two years ago, in a moment of inspiration, I started a series called "The Road to Transformer Upgrades", where I shared improvements and personal reflections on mainstream Transformer architectures, which was well-received by some readers. Starting from this article, we will follow a similar style to introduce another currently mainstream architecture: MoE (Mixture of Experts).

The popularity of MoE needs no introduction. The recently viral DeepSeek-V3 is an MoE architecture; rumors suggest GPT-4 is also an MoE architecture, and many recently released domestic models in China have also adopted MoE. However, although research on MoE has a long history, its application remained lukewarm for a long time. It was roughly starting from last year’s "Mixtral of Experts" that MoE gradually began to attract widespread attention. Its significant advantage is having a large number of parameters while maintaining significantly lower training and inference costs.

At the same time, MoE also faces several challenges, such as training instability, load imbalance, and suboptimal performance, which were the primary reasons it did not become popular in its early years. However, with the increased attention over the past two years, these problems have largely been resolved. We will discuss these topics one by one in the upcoming introductions.

Problem Definition

First, I should point out that I will use my own line of reasoning to introduce MoE. I will include relevant references where necessary, but I will not perform a systematic historical tracing of the MoE architecture; I hope readers will understand.

We know that Transformer models consist of Attention layers and MLP layers. MoE replaces the MLP layers in the model. MLP layers are further divided into FFN (FeedForward Network) and GLU (Gated Linear Unit) types. While GLU is more mainstream, for simplicity, we will use FFN as an example: \boldsymbol{y} = f(\boldsymbol{x}\boldsymbol{W}^{(A)})\boldsymbol{W}^{(B)} where \boldsymbol{x}\in\mathbb{R}^{d} is the input vector (row vector), \boldsymbol{W}^{(A)}\in\mathbb{R}^{d\times D} and \boldsymbol{W}^{(B)}\in\mathbb{R}^{D\times d} are two parameter matrices, and f is an element-wise activation function. Suppose n is an integer that can divide D. Then the above can be equivalently written using block matrices as: \boldsymbol{y} = f\big(\boldsymbol{x}\begin{bmatrix}\boldsymbol{W}^{(A)}_1 & \boldsymbol{W}^{(A)}_2 & \cdots & \boldsymbol{W}^{(A)}_n\end{bmatrix}\big)\begin{bmatrix}\boldsymbol{W}^{(B)}_1 \\ \boldsymbol{W}^{(B)}_2 \\ \vdots \\ \boldsymbol{W}^{(B)}_n\end{bmatrix} = \sum_{i=1}^n \underbrace{f(\boldsymbol{x}\boldsymbol{W}^{(A)}_i)\boldsymbol{W}^{(B)}_i}_{\boldsymbol{v}_i} where \boldsymbol{W}^{(A)}_i = \boldsymbol{W}^{(A)}_{[:,(i-1)c:ic]}, \boldsymbol{W}^{(B)}_i = \boldsymbol{W}^{(B)}_{[(i-1)c:ic,:]}, and c= D/n. The slicing follows Python rules. Thus, an FFN can be equivalently represented as the sum of n vectors \boldsymbol{v}_1, \boldsymbol{v}_2, \cdots, \boldsymbol{v}_n. Each vector represents the output of a small model f(\boldsymbol{x}\boldsymbol{W}^{(A)}_i)\boldsymbol{W}^{(B)}_i. Each small model has the same computational cost, and these small models are the "Experts" in MoE.

The question MoE poses is:

Can we pick only k vectors to sum up to approximate the sum of all n vectors? This would reduce the computational cost to k/n.

Norm Sorting

We actually explored this problem in "The Road to Low-Rank Approximation (III): CR". Written as a mathematical formula, it is: \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_n\in\{0,1\}}\left\Vert\sum_{i=1}^n \lambda_i \boldsymbol{v}_i - \sum_{i=1}^n\boldsymbol{v}_i\right\Vert^2\quad\text{s.t.}\quad \sum_{i=1}^n \lambda_i = k Letting \gamma_i = 1 - \lambda_i, it can be rewritten as: \mathop{\text{argmin}}_{\gamma_1,\gamma_2,\cdots,\gamma_n\in\{0,1\}}\left\Vert\sum_{i=1}^n \gamma_i \boldsymbol{v}_i\right\Vert^2\quad\text{s.t.}\quad \sum_{i=1}^n \gamma_i = n - k Solving this problem exactly is quite difficult, but there is a simple approximate solution: when the \boldsymbol{v}_i are pairwise orthogonal, we have: \left\Vert\sum_{i=1}^n \gamma_i \boldsymbol{v}_i\right\Vert^2 = \sum_{i=1}^n \gamma_i^2 \Vert\boldsymbol{v}_i\Vert^2 = \sum_{i=1}^n \gamma_i \Vert\boldsymbol{v}_i\Vert^2 The optimal solution to the above is obviously to set the n-k values of \gamma_i corresponding to the smallest norms \Vert\boldsymbol{v}_i\Vert to 1. This is equivalent to picking the k vectors with the largest norms to approximate the sum of n vectors. Even when \boldsymbol{v}_i do not satisfy the pairwise orthogonality condition, we still use this as an approximate solution. Its geometric meaning is also intuitive: vectors with larger norms are less likely to be canceled out during the summation process, thus playing a more prominent role.

Furthermore, in "The Road to Low-Rank Approximation (III): CR", we also discussed an approximation process based on probabilistic sampling. Under the assumption of minimum variance, the optimal sampling probability also has the characteristic of being proportional to the norm. Therefore, overall, sorting by vector norm is a simple yet effective strategy.

The Birth of MoE

Now we have a strategy—"pick the k vectors with the largest norms"—but upon closer reflection, we find it is not practical: to pick the k vectors with the largest norms, one must calculate the norms of all vectors, which means all \boldsymbol{v}_i must be calculated first. But our original goal was to reduce the computation of \boldsymbol{v}_i!

To resolve this contradiction, we need to redesign each Expert model so that its norm can be calculated at a low cost. What does this mean? First, we normalize \boldsymbol{v}_i to get \boldsymbol{e}_i = \boldsymbol{v}_i/\Vert\boldsymbol{v}_i\Vert, so that each \boldsymbol{e}_i has the same norm. Next, we define: \underbrace{[\rho_1,\rho_2,\cdots,\rho_n]}_{\boldsymbol{\rho}} = h(\boldsymbol{x}\boldsymbol{W}^{(R)})\quad\in\mathbb{R}_{\geq 0}^n where \boldsymbol{W}^{(R)}\in\mathbb{R}^{d\times n} is a parameter matrix, and h(\cdot) is an activation function from \mathbb{R}\to\mathbb{R}_{\geq 0}. Simply put, this is a linear transformation from d dimensions to n dimensions plus an activation function, so the computational cost is relatively small. This part of the model is called the "Router" in MoE.

What is the role of \boldsymbol{\rho}? To predict the norm of each Expert! In other words, we treat \rho_i as the norm of the i-th Expert. The complete Expert is \rho_i \boldsymbol{e}_i, which is decomposed into two parts: the norm \rho_i with low computational cost and the direction \boldsymbol{e}_i with high computational cost. To reduce computation, we first calculate \boldsymbol{\rho}, pick the top k largest values, and only then calculate the corresponding \boldsymbol{e}_i. Finally, we multiply by \rho_i and sum them up: \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{e}_i This is the basic formula for the MoE model. Since only the Top-k part is retained in the calculation, it essentially belongs to a Sparse model, while the original FFN or the model when k=n is usually called the corresponding Dense model.

Summary of Logic

Whether readers are familiar with MoE or not, they might find the above process a bit unfamiliar because it is a line of understanding MoE that I developed myself. However, because its geometric meaning is clearer, it should inherently be easier to understand.

Let’s organize the entire line of thought again:

A regular Dense FFN model can be equivalently rewritten as the sum of n Expert vectors \boldsymbol{v}_1, \boldsymbol{v}_2, \cdots, \boldsymbol{v}_n;
To save computation, we attempt to pick k vectors to sum up to approximate the original sum of n vectors;
After transforming this into a mathematical optimization problem, we find that the selection rule is to pick the k vectors with the largest norms;
Directly calculating the norms of n Experts and then selecting k does not actually save computation, so the Expert must be redesigned;
Normalize \boldsymbol{v}_i to get \boldsymbol{e}_i, and then use another small model (Router) to predict the norm \rho_i. The final Expert is \rho_i \boldsymbol{e}_i;
At this point, we can calculate all \rho_i first, pick the top k, and only then calculate \boldsymbol{e}_i, achieving the goal of saving computation.

Why This Way

Some readers might wonder why we go through this seemingly complex process. Isn’t the original MoE easy enough to understand? The general form of MoE is: \boldsymbol{y} = \sum_{i\in \mathop{\text{argtop}}_k \boldsymbol{\rho}} \rho_i \boldsymbol{v}_i That is, the normalization of \boldsymbol{v}_i is missing before the summation. In this case, \rho_i does not have the meaning of a norm; it is purely a scoring model (i.e., Router) used to rank Experts. But why does multiplying \rho_i onto the Expert allow the Router to learn to rank Experts correctly? I found that only "Sparse Backpropagation for MoE Training" provides an explanation for this, but it is still somewhat less than intuitive.

In the geometric perspective of this article, we find that many issues "suddenly become clear." After reparameterizing the Expert as \rho_i \boldsymbol{e}_i, the Dense model corresponds to the sum of all \rho_i \boldsymbol{e}_i, while MoE corresponds to the sum after selecting the Top-k of \rho_i. This is an approximation of the Dense model with theoretical guarantees. We do not need to worry about how the Router chooses Experts; we simply try to approximate the Dense model at every step. This can be said to be the best choice for wanting both large parameters and small computational costs.

Now that the geometric meaning of \rho_i is a norm rather than a probability, the activation function h(\cdot) has no requirement for normalization. Besides Softmax, one could consider using Sigmoid or ReLU, or the Top-k smooth approximation introduced in "The Sequel to Softmax: Finding Smooth Approximations for Top-K". Using a non-normalized activation function for the Router helps avoid unhealthy competition between Experts when k > 1, which sometimes yields better results.

Finally, to supplement: we previously defined \boldsymbol{e}_i = \boldsymbol{v}_i/ \Vert\boldsymbol{v}_i\Vert to ensure all \boldsymbol{e}_i have the same norm. In practice, it doesn’t have to be L2 normalization; it can be any equivalent operation, such as RMS Norm with the gamma parameter fixed to 1, which aligns better with our output conventions.

Conclusion

This article derives and understands MoE from the perspective of the best approximation of a Dense model, resulting in a specific form of MoE. It adds a Normalize step compared to existing MoE, but it makes the geometric meaning of MoE much more apparent. Of course, whether normalized or not, the journey of MoE has only just begun, and more challenges lie ahead.

Reprinting please include the original address of this article: https://kexue.fm/archives/10699

For more detailed reprinting matters, please refer to: "Scientific Space FAQ"