Steepest Descent on Manifolds: 6. Muon + Double Rotation · English (unofficial) translations of posts at kexue.fm

We know that when updating matrix parameters with optimizers like Adam and Muon, the singular values and left/right singular vectors all change accordingly, and they are usually coupled together. It is precisely because of this coupling that we cannot simply control the singular values of matrix parameters. Therefore, when singular values grow abnormally, we cannot easily and effectively prevent it, which may lead to training failure.

Inspired by “Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation” (hereinafter referred to as Pion), this paper proposes a Muon variant that updates the left and right singular vectors of the matrix separately — “Rotational Muon (MuonR)”, which can maintain the singular value distribution of the matrix unchanged, thereby ensuring training stability.

Review of Previous Work

Since the matrix composed of left and right singular vectors must be an orthogonal matrix, let us first briefly review Muon under orthogonal constraints. Let the parameter \boldsymbol{W}\in\mathbb{R}^{n\times n} satisfy \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I}, and let the update be \Delta\boldsymbol{W}=-\eta \boldsymbol{\Phi}. We hope that after the parameter update, orthogonality is still maintained. Then the corresponding steepest descent problem in spectral norm is \max_{\boldsymbol{\Phi}} \mathop{\text{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\quad(\boldsymbol{W} - \eta \boldsymbol{\Phi})^{\top}(\boldsymbol{W} - \eta \boldsymbol{\Phi})=\boldsymbol{I} which can be solved to give \boldsymbol{\Phi} = \boldsymbol{W}\boldsymbol{O}, where \boldsymbol{O}=\mathop{\text{msign}}([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}), and [\boldsymbol{X}]_{\text{skew}} = (\boldsymbol{X} - \boldsymbol{X}^{\top})/2 is the skew-symmetrization operator. Taking into account the retraction operation, the complete update rule is: \boldsymbol{W} \quad \leftarrow\quad \boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O})\left(\boldsymbol{I} - \boldsymbol{O}^{\top}\boldsymbol{O} + \frac{\boldsymbol{O}^{\top}\boldsymbol{O}}{\sqrt{1+\eta^2}}\right)\label{eq:orth-steepest} In particular, if [\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}} is full rank, it simplifies to \boldsymbol{W} \quad \leftarrow\quad \frac{\boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O})}{\sqrt{1+\eta^2}}\label{eq:orth-steepest-full} The derivation can be found in “Steepest Descent on Manifolds: 2. Muon + Orthogonality”, and we will not expand on it here. Whether it is Equation [eq:orth-steepest] or [eq:orth-steepest-full], they are completely analytical, adding only a few matrix multiplications on top of Muon, with no significant increase in complexity, so this result is entirely practical.

Now consider a matrix \boldsymbol{W}\in\mathbb{R}^{n\times m}(n \geq m). If it also satisfies \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I}, we say that \boldsymbol{W} lies on the Stiefel manifold, which is a generalization of the concept of orthogonal matrices. The above results can theoretically be extended to the Stiefel manifold, but for non-square matrices, a system of nonlinear equations needs to be solved, making it difficult to apply in practice. Details can be found in “Steepest Descent on Manifolds: 3. Muon + Stiefel”.

Instantaneous Reparameterization

Next, we turn our attention to an arbitrary parameter matrix \boldsymbol{W}\in\mathbb{R}^{n\times m}, with the goal of keeping the singular values unchanged during parameter updates, thereby eliminating the possibility of abnormal growth of singular values.

To achieve this goal, we adopt the idea of “instantaneous reparameterization”: before the update begins, we reparameterize the matrix \boldsymbol{W} as \tilde{\boldsymbol{W}} = \boldsymbol{L}\boldsymbol{W}\boldsymbol{R}, where \boldsymbol{L}\in\mathbb{R}^{n\times n},\boldsymbol{R}\in\mathbb{R}^{m\times m}, and both are initialized as identity matrices. In this way, at initialization we have \tilde{\boldsymbol{W}}=\boldsymbol{W}, and by denoting \boldsymbol{G} = \nabla_{\boldsymbol{W}}\mathcal{L}, we can write \nabla_{\boldsymbol{L}}\mathcal{L} = \boldsymbol{G}\boldsymbol{W}^{\top},\qquad \nabla_{\boldsymbol{R}}\mathcal{L} = \boldsymbol{W}^{\top}\boldsymbol{G} Then, we agree to freeze \boldsymbol{W} and only update \boldsymbol{L} and \boldsymbol{R}, while maintaining the orthogonality of \boldsymbol{L},\boldsymbol{R} during the update process. In this way, the updated \tilde{\boldsymbol{W}} still has the same singular values as \boldsymbol{W}. Now from the perspective of \boldsymbol{L} and \boldsymbol{R}, the problem once again becomes steepest descent on the orthogonal manifold, and this time \boldsymbol{L},\boldsymbol{R} are square matrices, so the corresponding steepest descent has a completely analytical solution! According to Equation [eq:orth-steepest-full], we can directly write the update rules \begin{gathered} \boldsymbol{L}\quad\leftarrow\quad (\boldsymbol{I} - \eta\boldsymbol{O}_L)\left(\boldsymbol{I} - \boldsymbol{O}_L^{\top}\boldsymbol{O}_L + \frac{\boldsymbol{O}_L^{\top}\boldsymbol{O}_L}{\sqrt{1+\eta^2}}\right)\\ \boldsymbol{R}\quad\leftarrow\quad (\boldsymbol{I} - \eta\boldsymbol{O}_R)\left(\boldsymbol{I} - \boldsymbol{O}_R^{\top}\boldsymbol{O}_R + \frac{\boldsymbol{O}_R^{\top}\boldsymbol{O}_R}{\sqrt{1+\eta^2}}\right) \end{gathered} where \boldsymbol{O}_L = \mathop{\text{msign}}([\boldsymbol{G}\boldsymbol{W}^{\top}]_{\text{skew}}),\boldsymbol{O}_R = \mathop{\text{msign}}([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}). Multiplying the new \boldsymbol{L},\boldsymbol{R} with \boldsymbol{W} gives the complete update rule \boldsymbol{W} \quad \leftarrow\quad \boldsymbol{L}\boldsymbol{W}\boldsymbol{R} This is the “Rotational Muon (Muon under Rotation, MuonR)” derived from the idea of “instantaneous reparameterization”. In practical scenarios, momentum is usually present, which we understand as a smoothed gradient, so we only need to replace the gradient \boldsymbol{G} with the momentum \boldsymbol{M}.

Some Details

Since the updates of \boldsymbol{L} and \boldsymbol{R} each require computing \mathop{\text{msign}} once, even in the most ideal case (n=m), the computational cost of MuonR is twice that of Muon. However, for sufficiently large models, this doubling of computation has a relatively weak impact on the end-to-end training time and is usually acceptable. If one wishes to reduce this overhead, one can consider alternating updates of \boldsymbol{L} and \boldsymbol{R} to amortize the computation.

In fact, the biggest problem with MuonR is that it keeps all singular values of the matrix unchanged from start to finish, which means we must determine all singular values of the parameters at initialization. This is not easy, because matrices at different positions may require different scales, and forcing them to the same set of values is likely suboptimal.

A feasible idea is to, on the basis of appropriate random initialization, add an element-wise multiplication vector before or after each matrix to compensate for the degree of freedom in scale. For matrices immediately following RMSNorm, the built-in gamma parameter of RMSNorm already plays this role, so this operation can be omitted for those matrices.

As for how to choose the initial singular values of the matrix, we can consider conventional random initialization, or we can construct them according to Zipf’s law. Furthermore, we can also try to adjust the singular value entropy to the optimal entropy calculated in “Is Higher Singular Value Entropy of Matrix Parameters Always Better?” to achieve better results.

Of course, if we can indeed determine the singular values of the matrix in advance — for example, if we expect the parameters at a certain location to always remain orthogonal — then we don’t need to consider these and can directly apply MuonR.

Midway Switching

Another optional approach is “midway switching”, using MuonR only as a means of “stabilization”.

Specifically, we start with conventional Muon and monitor the spectral norm / Frobenius norm of the matrix. Once the norm of the matrix exceeds the desired range, we switch to MuonR. Since both Muon variants depend only on the same gradient/momentum and differ only in computation, such switching is allowed. MuonR does not change the singular values of the matrix, so the spectral norm / Frobenius norm will no longer increase, making it a suitable “stabilization” tool.

However, it is important to align the update magnitudes before and after the switch as much as possible to avoid introducing “sudden changes”. To this end, we consider the first-order approximation of MuonR: \boldsymbol{L}\boldsymbol{W}\boldsymbol{R} \approx (\boldsymbol{I} - \eta\boldsymbol{O}_L) \boldsymbol{W} (\boldsymbol{I} - \eta\boldsymbol{O}_R) \approx \boldsymbol{W} - \eta(\boldsymbol{O}_L \boldsymbol{W} + \boldsymbol{W} \boldsymbol{O}_R) Since the singular values of \boldsymbol{O}_L and \boldsymbol{O}_R do not exceed 1 (note that we cannot guarantee that [\boldsymbol{G}\boldsymbol{W}^{\top}]_{\text{skew}} and [\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}} are full rank, so we cannot directly use the orthogonality of \boldsymbol{O}_L,\boldsymbol{O}_R), we have \Vert\boldsymbol{O}_L \boldsymbol{W}\Vert_F \leq \Vert \boldsymbol{W}\Vert_F and \Vert \boldsymbol{W}\boldsymbol{O}_R\Vert_F\leq \Vert\boldsymbol{W}\Vert_F, thus \Vert\boldsymbol{O}_L \boldsymbol{W} + \boldsymbol{W} \boldsymbol{O}_R\Vert_F \leq \Vert\boldsymbol{O}_L \boldsymbol{W}\Vert_F + \Vert\boldsymbol{W} \boldsymbol{O}_R\Vert_F \leq 2\Vert\boldsymbol{W}\Vert_F Conventional Muon is \boldsymbol{W} - \eta \mathop{\text{msign}}(\boldsymbol{G}), and the Frobenius norm of \mathop{\text{msign}}(\boldsymbol{G}) is generally \sqrt{\min(n,m)}. Therefore, to align the Frobenius norm of the update, when switching from Muon to MuonR, the learning rate should roughly be multiplied by \frac{\sqrt{\min(n,m)}}{2\Vert\boldsymbol{W}\Vert_F}.

In practice, the first inequality above may not be tight; \boldsymbol{O}_L \boldsymbol{W} and \boldsymbol{W} \boldsymbol{O}_R are more often nearly orthogonal. According to the Pythagorean theorem, the result should be approximately \sqrt{2}\Vert\boldsymbol{W}\Vert_F, so the multiplier should be multiplied by an additional \sqrt{2}. However, considering that the difference between \sqrt{2} and 1 is not particularly large, and to ensure usability in extreme cases, it is recommended to keep the above form.

Comparison and Analysis

At the beginning of this article, it was clearly stated that MuonR is inspired by Pion. Let us first examine their connections and differences.

First, the idea of restricting the update rule to the form of left and right multiplication by orthogonal matrices (double rotation) mainly comes from Pion. After determining this update form, obtaining the corresponding gradients through “instantaneous reparameterization” is a relatively natural step. Then, Pion and MuonR begin to “diverge”:

1. Pion achieves orthogonality through the matrix exponential \exp(\text{skew-symmetric matrix}), and in actual computation, it approximates by expanding to second order; 2. Pion follows the Adam approach and maintains separate moving averages of the gradients of \boldsymbol{L} and \boldsymbol{R}, which results in four sets of cached variables; 3. MuonR follows the Muon approach and, like Muon, only caches momentum, allowing us to switch to and from Muon at any time; 4. MuonR is based on the analytical solution of steepest descent on the orthogonal manifold, requiring only a few extra steps to accurately achieve orthogonality.

Overall, Pion’s orthogonality design is relatively empirical, and the four sets of cached variables can be somewhat “daunting”; MuonR, on the other hand, is a relatively natural product of a series of works on Muon and steepest descent on orthogonal manifolds. The author believes that it is more in line with first principles overall.

Conclusion

This paper proposes MuonR, a Muon variant that constrains the update form to left and right rotation matrices. It can keep the singular value distribution of the matrix unchanged and is a concise training scheme for maintaining training stability.

For reprinting, please include the address of this article: https://kexue.fm/archives/11777

For more detailed reprinting matters, please refer to: Scientific Space FAQ