Steepest Descent on Manifolds: 2. Muon + Orthogonality · English (unofficial) translations of posts at kexue.fm

This article continues our series on constrained optimization. In the previous post, "Steepest Descent on Manifolds: 1. SGD + Hypersphere", we revisited the "principle of least action" for optimizers, proposing that the core difference between optimizers lies in the different constraints imposed on the update amount. If this constraint is the Euclidean norm, the corresponding steepest descent is SGD. Furthermore, we discussed the results after adding norm constraints to the parameters, which constitutes steepest descent on a hypersphere manifold.

However, the previous post was merely a "warm-up" because it dealt with relatively simple vector parameter optimization. This article formally enters a more challenging part—where optimization parameters change from vectors to matrices, and the incremental constraint is changed to the spectral norm, leading to the Muon optimizer. Then, we add orthogonal constraints to the parameters, which will yield the Muon optimizer under an orthogonal manifold.

Proposition Description

Let the parameters to be optimized have a matrix form \boldsymbol{W}\in\mathbb{R}^{n\times m}, and without loss of generality, let n\geq m. According to the "principle of least action" from the previous article, we conclude that the steepest descent increment \Delta\boldsymbol{W} should satisfy: \begin{equation} \min_{\Delta \boldsymbol{W}} \mathcal{L}(\boldsymbol{W} +\Delta\boldsymbol{W}) \qquad \text{s.t.}\qquad \rho(\Delta\boldsymbol{W})\leq \eta \end{equation} If \rho is taken as the Frobenius norm (F-norm), we will get the same result as in the previous section, because the F-norm treats the matrix as a vector and calculates the L2 norm of the vector, so the result is also SGD treating the matrix as a vector. To obtain results that better reveal and fit the essence of matrices, the norm we choose here is the spectral norm, also known as the "2-norm", denoted as \|\cdot\|_2.

As for why we choose the spectral norm, you can refer to "Muon Optimizer Appreciation: The Essential Leap from Vector to Matrix", "Muon Sequel: Why We Choose to Try Muon?", and "Higher-order MuP: Simpler but Smarter Spectral Condition Scaling". We will not repeat the introduction here. Simply put, the spectral norm is the most compact norm that reveals the changes in a linear layer, so it is more suitable as a measure of the "stability" of a matrix.

Following the previous steps, a first-order approximation of \mathcal{L}(\boldsymbol{W} +\Delta\boldsymbol{W}) gives \mathcal{L}(\boldsymbol{W}) + \langle \boldsymbol{G}, \Delta\boldsymbol{W}\rangle_F, where \boldsymbol{G}=\nabla_{\boldsymbol{W}}\mathcal{L}(\boldsymbol{W}). Here \langle\cdot,\cdot\rangle_F is the inner product after flattening the two matrices into vectors, which is also equal to \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\Delta\boldsymbol{W}). Then, setting \Delta\boldsymbol{W} = -\eta \boldsymbol{\Phi}, the original proposition can be simplified to: \begin{equation} \max_{\boldsymbol{\Phi}} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \|\boldsymbol{\Phi}\|_2 = 1 \label{eq:muon-obj} \end{equation} Up to this point, these transformation steps are general. If you have forgotten the details, please refer back to the previous article.

Basic Results

The solving process for the objective [eq:muon-obj] was provided in the "Matrix Norm" section of "Muon Optimizer Appreciation: The Essential Leap from Vector to Matrix". However, for completeness, we repeat it here. Let the SVD of \boldsymbol{G} be \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top} = \sum_{i=1}^r \sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top}, where r is the rank of \boldsymbol{G}. We have: \begin{equation} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi})=\mathop{\mathrm{tr}}\left(\sum_{i=1}^r \sigma_i \boldsymbol{v}_i \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\right) = \sum_{i=1}^r \sigma_i \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i \end{equation} By definition, when \|\boldsymbol{\Phi}\|_2=1, \|\boldsymbol{\Phi}\boldsymbol{v}_i\|_2\leq \|\boldsymbol{v}_i\|_2=1, so \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i\leq 1. Therefore: \begin{equation} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi})\leq \sum_{i=1}^r \sigma_i = \|\boldsymbol{G}\|_* \end{equation} Here \|\cdot\|_* is called the Nuclear Norm of the matrix. The equality holds when all \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i are equal to 1, at which point: \begin{equation} \boldsymbol{\Phi} = \sum_{i=1}^r \boldsymbol{u}_i \boldsymbol{v}_i^{\top} = \boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top} = \mathop{\mathrm{msign}}(\boldsymbol{G}) \end{equation} Note that if r < m, then adding \boldsymbol{u}_{r+1} \boldsymbol{v}_{r+1}^{\top}, \boldsymbol{u}_{r+2} \boldsymbol{v}_{r+2}^{\top}, \dots will also make the equality hold, meaning the solution is not unique. However, since terms greater than r cannot be uniquely determined, the above expression can be considered a deterministic and minimal solution. Additionally, interested readers can try using the "big gun"—Von Neumann’s trace inequality—to find the general solution under the Schatten-p norm, where the spectral norm corresponds to the special case of p\to\infty.

Orthogonal Manifold

So far, we have proved that for matrix parameters, the direction of steepest descent under the spectral norm constraint is not the negative gradient direction -\boldsymbol{G}, but rather requires an additional \mathop{\mathrm{msign}} operator, i.e., -\mathop{\mathrm{msign}}(\boldsymbol{G}). This is exactly the Muon optimizer used to train Kimi K2, which is currently one of the most competitive optimizers. This in turn suggests that the spectral norm is very appropriate as a stability constraint for matrices.

Of course, the results so far are old. Now we start to explore something new—adding an orthogonal constraint \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I} to the parameter \boldsymbol{W} (Source: "Orthogonal manifold"). This is divided into two cases: first, n=m, where \boldsymbol{W} is a proper orthogonal matrix satisfying \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{W}\boldsymbol{W}^{\top}=\boldsymbol{I}; second, n > m, where \boldsymbol{W}\boldsymbol{W}^{\top}=\boldsymbol{I} cannot be satisfied, usually called a semi-orthogonal matrix, and the corresponding space is called a Stiefel manifold.

Specifically, the problem we want to solve now is: \begin{equation} \max_{\boldsymbol{\Phi}} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \|\boldsymbol{\Phi}\|_2 = 1,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,(\boldsymbol{W} - \eta \boldsymbol{\Phi})^{\top}(\boldsymbol{W} - \eta \boldsymbol{\Phi})=\boldsymbol{I} \end{equation} Still adhering to the principle that "first-order approximation is sufficient," the last condition can be simplified to \boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0}, meaning \boldsymbol{W}^{\top}\boldsymbol{\Phi} is a skew-symmetric matrix: \begin{equation} \max_{\boldsymbol{\Phi}} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \|\boldsymbol{\Phi}\|_2 = 1,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0} \label{eq:muon-obj-orth} \end{equation}

When are orthogonal constraints used? In fact, there are many scenarios. For example, in classification problems, if it is known that there is little correlation between different categories, one can consider imposing orthogonal constraints on the category matrix. However, most of the time we achieve approximate orthogonality by adding a regularization term \|\boldsymbol{W}^{\top}\boldsymbol{W}-\boldsymbol{I}\|_F^2 to the model. Another example is in LoRA scenarios, where the \boldsymbol{A}\boldsymbol{B} parameterization is actually redundant, and redundancy can be reduced through orthogonal constraints (reference), and so on.

Solution Process

To solve the objective [eq:muon-obj-orth], similar to the previous article, we introduce a Lagrange multiplier matrix \boldsymbol{\Lambda}\in\mathbb{R}^{m\times m}, obtaining: \begin{equation} \begin{aligned} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) =&\, \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \mathop{\mathrm{tr}}(\boldsymbol{\Lambda}^{\top}(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})) \\ =&\, \mathop{\mathrm{tr}}((\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top}))^{\top}\boldsymbol{\Phi}) \\ \leq &\,\|\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top})\|_* \end{aligned} \end{equation} The second equality uses the trace identity \mathop{\mathrm{tr}}(\boldsymbol{A}\boldsymbol{B}) = \mathop{\mathrm{tr}}(\boldsymbol{B}\boldsymbol{A}) = \mathop{\mathrm{tr}}(\boldsymbol{A}^{\top}\boldsymbol{B}^{\top}) = \mathop{\mathrm{tr}}(\boldsymbol{B}^{\top}\boldsymbol{A}^{\top}). According to the previous Muon results, the condition for equality is: \begin{equation} \boldsymbol{\Phi} = \mathop{\mathrm{msign}}(\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top})) \end{equation} The remaining problem is to find a real symmetric matrix \boldsymbol{X} = \boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top} such that \boldsymbol{W}^{\top}\boldsymbol{\Phi} is a skew-symmetric matrix. This is easy to solve for n=m, because in this case \boldsymbol{W}^{\top} can be absorbed into the \mathop{\mathrm{msign}} operator: \begin{equation} \boldsymbol{W}^{\top}\boldsymbol{\Phi} = \boldsymbol{W}^{\top}\mathop{\mathrm{msign}}(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}) = \mathop{\mathrm{msign}}(\boldsymbol{W}^{\top}(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X})) = \mathop{\mathrm{msign}}(\boldsymbol{W}^{\top}\boldsymbol{G} +\boldsymbol{X}) \end{equation} Note that \mathop{\mathrm{msign}} also preserves skew-symmetry; that is, if a square matrix \boldsymbol{M} is skew-symmetric, then \mathop{\mathrm{msign}}(\boldsymbol{M}) is also skew-symmetric. Thus, to make \boldsymbol{W}^{\top}\boldsymbol{\Phi} skew-symmetric, we only need to make \boldsymbol{W}^{\top}\boldsymbol{G} + \boldsymbol{X} skew-symmetric. Since \boldsymbol{X} is symmetric, this is equivalent to decomposing \boldsymbol{W}^{\top}\boldsymbol{G} into the sum of a symmetric matrix and a skew-symmetric matrix: \begin{equation} \boldsymbol{W}^{\top}\boldsymbol{G} = \underbrace{\frac{1}{2}(\boldsymbol{W}^{\top}\boldsymbol{G} + \boldsymbol{G}^{\top}\boldsymbol{W})}_{[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}} + \underbrace{\frac{1}{2}(\boldsymbol{W}^{\top}\boldsymbol{G} - \boldsymbol{G}^{\top}\boldsymbol{W})}_{[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}} \end{equation} where [\boldsymbol{M}]_{\text{sym}} = (\boldsymbol{M}+\boldsymbol{M}^{\top})/2 and [\boldsymbol{M}]_{\text{skew}} = (\boldsymbol{M}-\boldsymbol{M}^{\top})/2. Based on this, we conclude \boldsymbol{X} = -[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}. The solution for n > m is more complex and will be discussed in the next article.

Retraction Operation

In summary, for n=m, our final result is: \begin{equation} \begin{aligned} \boldsymbol{\Phi} =&\, \mathop{\mathrm{msign}}(\boldsymbol{G} - \boldsymbol{W}[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}) \\ =&\, \boldsymbol{W}\mathop{\mathrm{msign}}([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}) \end{aligned} \end{equation} Therefore, the updated variable is: \begin{equation} \boldsymbol{W} - \eta \boldsymbol{\Phi} = \boldsymbol{W}(\boldsymbol{I} - \eta\,\underbrace{\mathop{\mathrm{msign}}([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}})}_{\text{denoted as } \boldsymbol{O}}) \label{eq:updated-W-final} \end{equation} This is accurate to \mathcal{O}(\eta^2). To verify orthogonality: \begin{equation} \begin{aligned} (\boldsymbol{I} - \eta\boldsymbol{O})^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O}) =&\,(\boldsymbol{I} - \eta\boldsymbol{O})^{\top}(\boldsymbol{I} - \eta\boldsymbol{O}) \\ =&\,\boldsymbol{I} + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O} \end{aligned} \label{eq:orth-check-final} \end{equation} The complete update rule to maintain orthogonality is: \begin{equation} \boldsymbol{W} \leftarrow \mathop{\mathrm{msign}}(\boldsymbol{W} - \eta \boldsymbol{\Phi}) = \boldsymbol{W}\mathop{\mathrm{msign}}(\boldsymbol{I} - \eta\boldsymbol{O}) \end{equation} Using the property (\boldsymbol{O}^{\top}\boldsymbol{O})^2 = \boldsymbol{O}^{\top}\boldsymbol{O}, we can simplify the \mathop{\mathrm{msign}} operation: \begin{equation} \boldsymbol{W} \leftarrow \boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O})\left(\boldsymbol{I} - \boldsymbol{O}^{\top}\boldsymbol{O} + \frac{\boldsymbol{O}^{\top}\boldsymbol{O}}{\sqrt{1+\eta^2}}\right) \end{equation}

Summary

In this article, we revisited the Muon optimizer as a result of spectral norm constraints and explored its form under orthogonal constraints. This provides a way to maintain parameter orthogonality during optimization.

Original address: https://kexue.fm/archives/11215