Anyone who has trained MoE models
knows that if you replace all MLP layers with MoE, the first few MoE
layers near the embedding often struggle to achieve load balancing. To
address this, DeepSeek
V3, as well as our Kimi K2, adopted the
strategy of “first_k_dense” — as the name suggests, the
first k layers use regular Dense GLU
instead of MoE. In DeepSeek
V4, this strategy was changed to “first_k_hash”.
Here “hash” refers to “Hash Routing”, proposed in Hash Layers For Large Sparse
Models. It uses a pre-determined mapping table tid2eid
to assign an Expert to each Token. This article explores how to generate
this tid2eid.
Problem Background
First of all, it should be pointed out that for the first few MoE layers, if one uses the Quantile Balancing method proposed by the author in MoE Travelogue: 6. Optimal Allocation Promotes Balance and MoE Travelogue: 7. Minimalist Solution for Dynamic Activation, the imbalance problem can actually be alleviated to a large extent.
As for V4’s adoption of Hash Routing, it can be seen as a different
attempt at the same problem. The idea is that for the first few MoE
layers, there is not much contextual information, so selecting Experts
based on the input vector may not be much different from selecting based
on the Token Id, which has no contextual information at all. In that
case, it might be better to simply pre-determine which Experts each
Token should activate in the first few layers based on the Token Id.
This is the origin of “tid2eid (Token Id to Expert
Id)”.
So how do we generate this tid2eid table? Can it be
completely random? It seems not, because the frequency of each Token is
different; complete randomness would instead lead to imbalance. DeepSeek
has not disclosed the details of generating this table, so we can only
speculate based on our own reasoning.
Mathematical Description
Assume there are m distinct Tokens, and the frequency of the i-th Token is p_i. Without loss of generality, assume they are sorted in descending order, i.e., p_i \geq p_{i+1}. Let x_{i,j} \in \{0, 1\} indicate whether Token i should activate Expert j (0 means not activate, 1 means activate). There are n Experts, and each Token selects k Experts. Then we essentially need to solve the system of equations: x_{i,j}\in\{0, 1\},\qquad\sum_{j=1}^n x_{i,j} = k,\qquad \sum_{i=1}^m p_i x_{i,j}\approx \frac{k}{n} Note that in the last equation we use approximately equal \approx, because if we change it to =, the system may not have a solution strictly speaking. So we keep the \approx notation, meaning the two sides should be as close as possible. It can also be written as an optimization problem: \min_{x_{i,j}\in\{0, 1\}} \sum_{j=1}^n \left(\sum_{i=1}^m p_i x_{i,j} - \frac{k}{n}\right)^2 \qquad\text{s.t.}\qquad\sum_{j=1}^n x_{i,j} = k A relatively straightforward solution to this problem is a greedy algorithm:
Process Tokens one by one in descending order of frequency. First, tally the assigned load of each Expert, then select the k Experts with the lightest load as the activated Experts for the current Token, and then update the Expert load distribution.
Reference Implementation
Although the greedy algorithm is greedy in principle, it can obtain a near-optimal solution in most cases. A reference implementation is as follows:
import numpy as np
# Simulated distribution
m, n, k = 80000, 128, 4
p = 1 / (10 + np.arange(m))
p /= p.sum()
# Greedy processing
x, f = np.zeros((m, k), dtype='int32'), np.zeros(n)
for i in range(m):
j = f.argsort()[:k] # select the k with lightest load
f[j] += p[i] / k # update load distribution
x[i] = j # record into tid2eid
# Evaluate balance
max_vio, min_vio = f.max() * n - 1, f.min() * n - 1Note that this code has no randomness; it is in principle a
deterministic algorithm. But what if we want different
tid2eid for the first few layers? We can consider
introducing some randomness, for example, replacing
range(m) with np.random.permutation(m), i.e.,
not processing in descending order of frequency. This can still yield
usable solutions, though the balance will be slightly worse. We can run
it multiple times and pick the solutions with the lowest
max_vio.
Extreme Cases
Now consider an extreme case: p_1 \gg
k/n, i.e., the frequency of a certain Token far exceeds the
balanced level. In this case, no matter how tid2eid is
arranged, load balancing cannot be achieved. In other words, making
decisions based solely on a single Token Id can never achieve balance.
How should we deal with this situation?
The answer is simple: switch to a Hash method that relies on more input information. Previously we only made decisions based on the current Token Id. In reality, each Token is in a sequence and has context. If the previous approach was “1-gram to expert”, we can now consider combining it with the preceding Token to form “2-gram to expert”, “3-gram to expert”, etc.
Taking 2-gram (a, b) as an example, a naive Hash method is: choose a prime number q greater than m, compute (a q + b) \bmod n as the activated Expert, and repeat k times with different prime numbers to get k Experts. By increasing the gram size, the number of input combinations greatly increases, and then mapping to a finite number n, each number is likely to be “filled up”. Therefore, as long as the Hash function is not particularly bad, the result will be almost balanced.
Thus, its advantage is that there is no need to consider frequency,
nor to memorize a tid2eid table in advance. The
disadvantage is that the computation of the Hash function is slightly
more complex, and the same Token might select duplicate Experts, but
these are not particularly serious problems.
Summary
This article briefly outlines the basic idea of Hash Routing in
DeepSeek V4, and focuses on discussing the construction principle of its
tid2eid mapping table.
For reprinting, please include the address of this article: https://kexue.fm/archives/11750
For more detailed reprinting matters, please refer to: Science Space FAQ