Muon Implementation via Streaming Power Iteration: 5. Extensions · English (unofficial) translations of posts at kexue.fm

The theme of this series of articles is “streaming power iteration.” As the name suggests, it consists of two parts: “streaming” and “power iteration.” Power iteration is a classic multi-step iterative scheme for computing the SVD of a matrix, while “streaming” refers to amortizing the originally multi-step iterative algorithm over each training step, making the computational cost acceptable. The core idea is: rather than completing a complex computation all at once, it is better to continuously approach the target during the training process.

As an extension of this series, this article introduces several other applications of the “streaming” idea, further demonstrating how to ingeniously incorporate relatively expensive operations into the training process through streaming transformation.

Orthogonal Projection

In some scenarios, we may wish to constrain certain parameter matrices to be orthogonal. Orthogonal matrices possess good numerical stability, can avoid issues such as numerical explosion or vanishing, and in certain designs can provide better theoretical guarantees. Of course, which specific places are suitable for constraining parameters to be orthogonal matrices requires case-by-case analysis; we will not elaborate on that here.

In the articles “Steepest Descent on Manifolds: 2. Muon + Orthogonal” and “Steepest Descent on Manifolds: 3. Muon + Stiefel”, we explored orthogonal (Stiefel) manifolds, but those aimed to combine steepest descent to derive new update rules, which was overall quite complex. Here we only consider a simpler approach: after each update step, reproject (retract) the parameters back onto the orthogonal manifold.

Without loss of generality, let the parameters be \boldsymbol{W}\in\mathbb{R}^{n\times m}, with n \geq m. Then the operation we want to perform can be written as \boldsymbol{W}_t = \boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t \qquad\to\qquad \boldsymbol{W}_t = \mathop{\text{orth}}(\boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t) where \mathop{\text{orth}} is defined as \mathop{\text{orth}}(\boldsymbol{W}) = \mathop{\text{argmin}}_{\boldsymbol{O}^{\top}\boldsymbol{O}=\boldsymbol{I}} \Vert\boldsymbol{W} - \boldsymbol{O}\Vert_F i.e., finding the orthogonal matrix nearest to \boldsymbol{W}. This operation already appeared in the first blog post introducing Muon, “Appreciating the Muon Optimizer: The Essential Leap from Vectors to Matrices”. It is precisely the \mathop{\text{msign}} in Muon, so the operation we need to implement can be written as \boldsymbol{W}_t = \mathop{\text{msign}}(\boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t) \label{eq:msign-u}

Streaming Iteration

That is to say, after each update step we need to perform \mathop{\text{msign}} once more to project the parameters back onto the orthogonal manifold. However, the \mathop{\text{msign}} operation is not extremely expensive, but it is not cheap either. We know that the core computation of Muon is \mathop{\text{msign}}, but Muon’s \mathop{\text{msign}} runs in BF16, whereas the parameters are stored in FP32; therefore the \mathop{\text{msign}} on parameters needs to run in FP32, incurring a considerable cost.

The efficient computation of \mathop{\text{msign}} is based on the Newton–Schulz iteration, which we have discussed in detail in “Newton–Schulz Iteration for the msign Operator (Part 1)” and “Newton–Schulz Iteration for the msign Operator (Part 2)”. Different Newton–Schulz iterations have different polynomial degrees and coefficients, corresponding to different convergence rates. A classic 3rd-order scheme is \boldsymbol{X}_t = \frac{3}{2}\boldsymbol{X}_{t-1} - \frac{1}{2}\boldsymbol{X}_{t-1}\boldsymbol{X}_{t-1}^{\top}\boldsymbol{X}_{t-1},\qquad \boldsymbol{X}_0 = \boldsymbol{W} It can be shown that \lim_{t\to\infty} \boldsymbol{X}_t = \mathop{\text{msign}}(\boldsymbol{W}). Although this iteration is classic, it converges relatively slowly; in Muon we usually use a faster-converging 5th-order iteration. But regardless of which one, the cost of performing the full iteration is considerable.

However, for the parameters, do we really need to perform the full iteration at every step? Assuming \boldsymbol{W}_{t-1} is already orthogonal or nearly orthogonal, because the learning rate \eta is small, \boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t will not deviate significantly from orthogonality. This is precisely where the streaming idea comes into play—perhaps performing just one iteration per training step suffices. For example, consider \boldsymbol{W}_t = \frac{3}{2}\tilde{\boldsymbol{W}}_t - \frac{1}{2}\tilde{\boldsymbol{W}}_t\tilde{\boldsymbol{W}}_t^{\top}\tilde{\boldsymbol{W}}_t,\qquad \tilde{\boldsymbol{W}}_t = \boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t From an asymptotic perspective, this also achieves the effect of [eq:msign-u], and each step only requires one extra term \tilde{\boldsymbol{W}}_t\tilde{\boldsymbol{W}}_t^{\top}\tilde{\boldsymbol{W}}_t, which is significantly cheaper than the full Newton–Schulz iteration.

Spectral Constraint

In general, orthogonal constraints can only be imposed on certain special matrices, because they are too strict, reducing the degrees of freedom of the matrix by half—or equivalently, halving the number of parameters. In many cases we may prefer to impose a looser spectral constraint, such as the singular value clipping mentioned in “Higher-order MuP: Simpler but Smarter Spectral Condition Scaling”.

If \mathop{\text{msign}} turns all singular values of a matrix into 1, then singular value clipping only turns those greater than 1 into 1, leaving the rest unchanged. In “Computing Singular Value Clipping mclip via msign (Part 1)” and “Computing Singular Value Clipping mclip via msign (Part 2)”, we call it \mathop{\text{mclip}}. Suppose what we want to do is clip all singular values to at most 1 after each parameter update, then we can write \boldsymbol{W}_t = \mathop{\text{mclip}}(\boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t) \label{eq:mclip-u} However, computing \mathop{\text{mclip}} is considerably more troublesome than \mathop{\text{msign}}. Previously we explored schemes to implement \mathop{\text{mclip}} based on \mathop{\text{msign}}, such as the identity \mathop{\text{mclip}}(\boldsymbol{M}) = \frac{1}{2}\Big[\boldsymbol{M} + \mathop{\text{msign}}(\boldsymbol{M}) + (\mathop{\text{msign}}(\boldsymbol{M}) - \boldsymbol{M}) \mathop{\text{msign}}(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I})\Big] It requires two \mathop{\text{msign}} evaluations to compute \mathop{\text{mclip}}, which is quite expensive. Of course, we could also perform a streaming power iteration on \boldsymbol{W} and then compute \mathop{\text{mclip}} via SVD, but that would require an additional cached variable, which is also rather cumbersome.

One-by-one Clipping

Let us understand \mathop{\text{mclip}} from another perspective: \mathop{\text{mclip}} turns all singular values greater than 1 into 1. Therefore a necessary operation is to turn the principal singular value into 1 (if it is greater than 1). After we clip the principal singular value, if there still exist singular values greater than 1, the largest among them becomes the new principal singular value. This means we only need to repeatedly “clip the principal singular value to 1” to realize \mathop{\text{mclip}}.

The advantage of this “one-by-one clipping” strategy is that computing only the principal singular value and principal singular vectors (let us call it \mathop{\mathrm{SVD1}}) is far cheaper than performing a full SVD; it only requires a vector-style power iteration \boldsymbol{v}\quad\leftarrow\quad \frac{\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}}{\Vert\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}\Vert} which converges to the right principal singular vector \boldsymbol{v}_1 of \boldsymbol{W}, and then \sigma_1 = \Vert\boldsymbol{W}\boldsymbol{v}_1\Vert and \boldsymbol{u}_1 = \boldsymbol{W}\boldsymbol{v}_1/\sigma_1. In practice we fix the number of iterations to obtain an approximation. Then, still adhering to the “streaming” idea, at each update step we only perform one principal singular value clipping, i.e., \boldsymbol{W}_t = \tilde{\boldsymbol{W}}_t - \max(\sigma_1 - 1, 0) \boldsymbol{u}_1 \boldsymbol{v}_1^{\top},\quad\sigma_1, \boldsymbol{u}_1, \boldsymbol{v}_1 = \mathop{\mathrm{SVD1}}(\tilde{\boldsymbol{W}}_t),\quad\tilde{\boldsymbol{W}}_t = \boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t Under long-term training, this can control the singular values of \boldsymbol{W} to be near 1 (due to the approximation of power iteration and the streaming nature, they will actually be slightly larger than 1). This “streaming singular value clipping” can be regarded as a variant of the “Spectral Weight Decay” introduced in “From Spectral Norm Gradient to a Novel Weight Decay”, and in the paper “Training Transformers with Enforced Lipschitz Constants” it is called “Spectral Hammer.”

Other Examples

When is the streaming idea applicable? A typical scenario is: we can anticipate that under long-term training a certain variable changes slowly, so we can attempt to perform only a small number of iterations per training step, hoping that it is sufficient to correct the changes brought by parameter updates. In addition to the two new examples above, in some previous articles we have already employed the “streaming” idea; let us briefly review them here.

In “Steepest Descent on Manifolds: 3. Muon + Stiefel” and “Steepest Descent on Manifolds: 4. Muon + Spectral Sphere”, when solving steepest descent on the corresponding manifolds we needed to solve a nonlinear equation; at that time we proposed fixed-point iteration. Later in “Steepest Descent on Manifolds: 5. Dual Gradient Descent” we discussed the dual gradient descent method and proposed using the streaming idea to amortize the solution process over each training step.

Coincidentally, in “MoE Journey: 6. Optimal Allocation Promotes Balance”, we viewed the load balancing problem of MoE from the perspective of optimal allocation. When solving the dual problem of optimal allocation, originally we needed to alternately optimize \boldsymbol{\alpha} and \boldsymbol{\beta} until convergence at each step. But considering that the change of \boldsymbol{\beta} at each step should be small, still based on the streaming idea we only performed one update of \boldsymbol{\alpha} and \boldsymbol{\beta} per training step, achieving a decent load balancing effect.

In summary, when we realize that a variable changes slowly, we can consider performing only a small number of iterations per step to amortize the computational cost. To achieve this, sometimes we need to introduce additional cached variables (such as \boldsymbol{V} in streaming power iteration), but sometimes it is unnecessary. How to better design streaming iterations may require some ingenuity, and it is worth exploring and savoring in depth.

Summary

This article mainly introduced two additional examples of applying the “streaming” idea: projecting parameters onto the orthogonal (Stiefel) manifold with extremely small extra cost, and clipping the spectral norm of parameters to at most 1. These examples once again demonstrate the magic of the streaming idea: many seemingly complex computations that appear to require one-shot completion can be decomposed into gradual fine-tuning operations and integrated into each training step.

When reprinting, please include the address of this article: https://kexue.fm/archives/11719

For more detailed reprint matters, please refer to: Science Space FAQ