How is DeepSeek V4's tid2eid Generated? · English (unofficial) translations of posts at kexue.fm

Anyone who has trained MoE models knows that if you replace all MLP layers with MoE, the first few MoE layers near the embedding often struggle to achieve load balancing. To address this, DeepSeek V3, as well as our Kimi K2, adopted the strategy of “first_k_dense” — as the name suggests, the first k layers use regular Dense GLU instead of MoE. In DeepSeek V4, this strategy was changed to “first_k_hash”.

Here “hash” refers to “Hash Routing”, proposed in Hash Layers For Large Sparse Models. It uses a pre-determined mapping table tid2eid to assign an Expert to each Token. This article explores how to generate this tid2eid.

Problem Background

First of all, it should be pointed out that for the first few MoE layers, if one uses the Quantile Balancing method proposed by the author in MoE Travelogue: 6. Optimal Allocation Promotes Balance and MoE Travelogue: 7. Minimalist Solution for Dynamic Activation, the imbalance problem can actually be alleviated to a large extent.

As for V4’s adoption of Hash Routing, it can be seen as a different attempt at the same problem. The idea is that for the first few MoE layers, there is not much contextual information, so selecting Experts based on the input vector may not be much different from selecting based on the Token Id, which has no contextual information at all. In that case, it might be better to simply pre-determine which Experts each Token should activate in the first few layers based on the Token Id. This is the origin of “tid2eid (Token Id to Expert Id)”.

So how do we generate this tid2eid table? Can it be completely random? It seems not, because the frequency of each Token is different; complete randomness would instead lead to imbalance. DeepSeek has not disclosed the details of generating this table, so we can only speculate based on our own reasoning.

Mathematical Description

Assume there are m distinct Tokens, and the frequency of the i-th Token is p_i. Without loss of generality, assume they are sorted in descending order, i.e., p_i \geq p_{i+1}. Let x_{i,j} \in \{0, 1\} indicate whether Token i should activate Expert j (0 means not activate, 1 means activate). There are n Experts, and each Token selects k Experts. Then we essentially need to solve the system of equations: x_{i,j}\in\{0, 1\},\qquad\sum_{j=1}^n x_{i,j} = k,\qquad \sum_{i=1}^m p_i x_{i,j}\approx \frac{k}{n} Note that in the last equation we use approximately equal \approx, because if we change it to =, the system may not have a solution strictly speaking. So we keep the \approx notation, meaning the two sides should be as close as possible. It can also be written as an optimization problem: \min_{x_{i,j}\in\{0, 1\}} \sum_{j=1}^n \left(\sum_{i=1}^m p_i x_{i,j} - \frac{k}{n}\right)^2 \qquad\text{s.t.}\qquad\sum_{j=1}^n x_{i,j} = k A relatively straightforward solution to this problem is a greedy algorithm:

Process Tokens one by one in descending order of frequency. First, tally the assigned load of each Expert, then select the k Experts with the lightest load as the activated Experts for the current Token, and then update the Expert load distribution.

Reference Implementation

Although the greedy algorithm is greedy in principle, it can obtain a near-optimal solution in most cases. A reference implementation is as follows:

import numpy as np

# Simulated distribution
m, n, k = 80000, 128, 4
p = 1 / (10 + np.arange(m))
p /= p.sum()

# Greedy processing
x, f = np.zeros((m, k), dtype='int32'), np.zeros(n)
for i in range(m):
    j = f.argsort()[:k]  # select the k with lightest load
    f[j] += p[i] / k     # update load distribution
    x[i] = j             # record into tid2eid

# Evaluate balance
max_vio, min_vio = f.max() * n - 1, f.min() * n - 1

Note that this code has no randomness; it is in principle a deterministic algorithm. But what if we want different tid2eid for the first few layers? We can consider introducing some randomness, for example, replacing range(m) with np.random.permutation(m), i.e., not processing in descending order of frequency. This can still yield usable solutions, though the balance will be slightly worse. We can run it multiple times and pick the solutions with the lowest max_vio.

Extreme Cases

Now consider an extreme case: p_1 \gg k/n, i.e., the frequency of a certain Token far exceeds the balanced level. In this case, no matter how tid2eid is arranged, load balancing cannot be achieved. In other words, making decisions based solely on a single Token Id can never achieve balance. How should we deal with this situation?

The answer is simple: switch to a Hash method that relies on more input information. Previously we only made decisions based on the current Token Id. In reality, each Token is in a sequence and has context. If the previous approach was “1-gram to expert”, we can now consider combining it with the preceding Token to form “2-gram to expert”, “3-gram to expert”, etc.

Taking 2-gram (a, b) as an example, a naive Hash method is: choose a prime number q greater than m, compute (a q + b) \bmod n as the activated Expert, and repeat k times with different prime numbers to get k Experts. By increasing the gram size, the number of input combinations greatly increases, and then mapping to a finite number n, each number is likely to be “filled up”. Therefore, as long as the Hash function is not particularly bad, the result will be almost balanced.

Thus, its advantage is that there is no need to consider frequency, nor to memorize a tid2eid table in advance. The disadvantage is that the computation of the Hash function is slightly more complex, and the same Token might select duplicate Experts, but these are not particularly serious problems.

Summary

This article briefly outlines the basic idea of Hash Routing in DeepSeek V4, and focuses on discussing the construction principle of its tid2eid mapping table.

For reprinting, please include the address of this article: https://kexue.fm/archives/11750

For more detailed reprinting matters, please refer to: Science Space FAQ