Steepest Descent on Manifolds: 4. Muon + Spectral Sphere · English (unofficial) translations of posts at kexue.fm

Readers who have finished the first three articles should already be familiar with the “routine” of this series: first, propose constraints on the update amount to find the steepest descent direction, then add constraints to the parameters themselves to find a new steepest descent direction. When solving the parameter-constrained problem, we adopt the principle that a “first-order approximation is sufficient” to simplify the constraint form, which geometrically corresponds to the “tangent space.” Then, we use the method of undetermined coefficients to transform the problem into an unconstrained form to write an analytical solution, and finally solve for the undetermined coefficients numerically.

In this article, we will solve a new example—Muon under the spectral sphere constraint. This is an analogous generalization of the first article, “Steepest Descent on Manifolds: 1. SGD + Hypersphere.” It can be considered when we want the spectral norm of the parameters to remain constant. Of course, it can also be treated simply as a practice exercise.

Problem Description

In “Steepest Descent on Manifolds: 2. Muon + Orthogonality” and “Steepest Descent on Manifolds: 3. Muon + Stiefel,” we discussed the collision between Muon and orthogonality constraints in detail, so we will not expand on the background here and will directly give the problem form: \begin{equation} \max_{\boldsymbol{\Phi}} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \|\boldsymbol{\Phi}\|_2 = 1,\,\, \|\boldsymbol{W}\|_2 = 1,\,\,\|\boldsymbol{W} - \eta \boldsymbol{\Phi}\|_2=1 \end{equation} where \boldsymbol{W},\boldsymbol{\Phi}\in\mathbb{R}^{n\times m}(n \geq m), and \|\cdot\|_2 is the spectral norm. Of course, if we are interested, the latter two spectral norms can be replaced with other norms; for example, the F-norm would represent the combination of “Muon + Hypersphere.”

The “first-order approximation is sufficient” principle requires us to find the gradient of the spectral norm, which we have already introduced in “Reflections from Spectral Norm Gradients to New Weight Decay” and “Derivatives of SVD.” The answer is \nabla_{\boldsymbol{W}}\|\boldsymbol{W}\|_2=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}, where \boldsymbol{u}_1, \boldsymbol{v}_1 are the two singular vectors corresponding to the largest singular value of \boldsymbol{W}, which can be solved by power iteration. This result also assumes that the largest singular value is unique; the non-unique case will be discussed later.

If it were the F-norm, we would have \nabla_{\boldsymbol{W}}\|\boldsymbol{W}\|_F=\boldsymbol{W}/\|\boldsymbol{W}\|_F. In short, regardless of the norm, there exists a matrix \boldsymbol{\Theta} that depends only on \boldsymbol{W} such that \nabla_{\boldsymbol{W}}\|\boldsymbol{W}\|=\boldsymbol{\Theta}. Then, from \|\boldsymbol{W}\| = 1 and \|\boldsymbol{W} - \eta \boldsymbol{\Phi}\|=1, we can obtain its first-order approximation version 0 = \langle\boldsymbol{\Theta},\boldsymbol{\Phi}\rangle_F = \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}). Therefore, under the first-order approximation, the general formulation of this type of problem is: \begin{equation} \max_{\boldsymbol{\Phi}} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \|\boldsymbol{\Phi}\|_2 = 1,\,\, \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0 \end{equation}

Undetermined Coefficients

The routine is the same. Introducing the undetermined coefficient \lambda, we have: \begin{equation} \begin{aligned} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) =&\, \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \mathop{\mathrm{tr}}((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}\boldsymbol{\Phi}) \\ \leq &\,\|\boldsymbol{G} + \lambda\boldsymbol{\Theta}\|_* \end{aligned} \end{equation} The last inequality is the result of Muon itself, similar to the Hölder inequality for two vectors, with equality holding at: \begin{equation} \boldsymbol{\Phi} = \mathop{\mathrm{msign}}(\boldsymbol{G} + \lambda\boldsymbol{\Theta}) \end{equation} The next task is to solve for a \lambda such that it satisfies the constraint \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0, and we are done.

Due to the presence of \mathop{\mathrm{msign}}, \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0 is actually a nonlinear equation. The author tends to believe it has no analytical solution, so we seek a numerical solution method. However, with the experience from “Steepest Descent on Manifolds: 3. Muon + Stiefel,” we can calmly construct an iterative method for such equations.

Iterative Solution

First, according to the definition \mathop{\mathrm{msign}}(\boldsymbol{M}) = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2}, we can write \boldsymbol{\Phi}=(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1}, where \boldsymbol{Q}=((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta}))^{1/2}. Then: \begin{equation} \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0\qquad\Rightarrow\qquad \lambda = -\frac{\mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top}\boldsymbol{G}\boldsymbol{Q}^{-1})}{\mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta}\boldsymbol{Q}^{-1})} \end{equation} But note that this is not an analytical solution because \boldsymbol{Q} also depends on \lambda. However, based on the above equation, we can construct an iterative format: substitute an initial \lambda to find \boldsymbol{Q}= ((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta}))^{1/2}, then substitute it into the above equation to update \lambda, and execute repeatedly until convergence.

However, although this iterative scheme is theoretically feasible, it requires calculating \boldsymbol{Q}^{-1}. Although we mentioned efficient algorithms for solving this in “Efficient Calculation of Matrix r-th Roots and Inverse r-th Roots,” from the perspective of Occam’s razor (“do not multiply entities”), we still hope to avoid iterations other than \mathop{\mathrm{msign}}. Therefore, the author tried to find another iterative scheme that does not require calculating \boldsymbol{Q}^{-1}. To this end, we first write: \begin{equation} \boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1} \end{equation} For our goal, the trace of the above equation is zero. We can explicitly subtract \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m from the left side of the equation to ensure it satisfies this condition: \begin{equation} \boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} - \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1} \end{equation} At this point, multiplying both sides by \boldsymbol{Q} and then taking the trace allows us to solve for \lambda: \begin{equation} \lambda = \frac{\mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}\boldsymbol{Q}) - \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top}\boldsymbol{\Phi}) \mathop{\mathrm{tr}}(\boldsymbol{Q})/m - \mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top}\boldsymbol{G})}{\mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta})} \end{equation} In this way, the iteration does not require calculating \boldsymbol{Q}^{-1}.

Reference Code

We use \lambda=-\mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top}\boldsymbol{G})/\mathop{\mathrm{tr}}(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta}) as the initial value. The test code is as follows:

import numpy as np

def msign(g):
    """Accurate calculation of msign using SVD"""
    u, s, vh = np.linalg.svd(g, full_matrices=False)
    return u @ np.diag(np.sign(s)) @ vh

def dot(a, b):
    """Equivalent to np.trace(a.T @ b)"""
    return (a * b).sum()

n, m = 100, 50
w = np.random.randn(n, m) / m**0.5
g = np.random.randn(n, m) / m**0.5
u, s, vh = np.linalg.svd(w, full_matrices=False)
theta = u[:, :1] @ vh[:1]

lamb = - dot(theta, g) / dot(theta, theta)
for i in range(10):
    phi = msign(z := g + lamb * theta)
    print('step:', i, ', inner product:', dot(phi, g), ', tangent error:', dot(theta, phi))
    q, x = z.T @ phi, theta.T @ phi
    lamb = (dot(x, q) - np.trace(x) * np.trace(q) / m - dot(theta, g)) / dot(theta, theta)

Other Details

As in the previous three articles, because the “first-order approximation is sufficient” principle is used, the spectral norm of \boldsymbol{W} - \eta\boldsymbol{\Phi} is accurate to 1 + \mathcal{O}(\eta^2) and usually cannot be exactly 1. Therefore, we still need to perform a Spectral Normalization: \begin{equation} \boldsymbol{W}\quad\leftarrow\quad \frac{\boldsymbol{W} - \eta\boldsymbol{\Phi}}{\|\boldsymbol{W} - \eta\boldsymbol{\Phi}\|_2} \end{equation} Fortunately, the spectral norm can be efficiently calculated through power iteration, so this is not a particularly expensive calculation (compared to the iteration of \mathop{\mathrm{msign}} itself).

Another detail worth analyzing is the case where the maximum singular value is not unique. In actual numerical calculations, this special case can be ignored, but for theoretical completeness, it should be included in the analysis. In this case, the corresponding singular vectors are also not unique, which is equivalent to saying there are multiple different tangent spaces. The actual feasible space is the intersection of these tangent spaces. Taking two maximum singular values as an example, the problem becomes: \begin{equation} \max_{\boldsymbol{\Phi}} \mathop{\mathrm{tr}}(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \|\boldsymbol{\Phi}\|_2 = 1,\,\, \mathop{\mathrm{tr}}(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})=0,\,\, \mathop{\mathrm{tr}}(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})=0 \end{equation} Here \boldsymbol{\Theta}_1=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}, \boldsymbol{\Theta}_2=\boldsymbol{u}_2 \boldsymbol{v}_2^{\top}. Introducing two undetermined coefficients \lambda_1, \lambda_2, we can solve for: \begin{equation} \boldsymbol{\Phi} = \mathop{\mathrm{msign}}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2) \end{equation} Next, we need to solve the system of equations \mathop{\mathrm{tr}}(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})=0, \mathop{\mathrm{tr}}(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})=0. Similarly, introducing: \begin{equation} \boldsymbol{Q}=((\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2))^{1/2} = (\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}\boldsymbol{\Phi} \end{equation} We can write the system of equations: \begin{equation} \begin{gathered} \boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi} - \mathop{\mathrm{tr}}(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_1^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi} - \mathop{\mathrm{tr}}(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_2^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \end{gathered} \end{equation} Multiplying both sides by \boldsymbol{Q} and taking the trace transforms this into a system of two linear equations for \lambda_1, \lambda_2 that can be solved, which can then be used to construct an iterative solution format. The details will not be expanded upon here; interested readers can complete them as an exercise.

Summary

This article mainly considers the Muon form corresponding to applying spectral norm or general norm constraints to parameters. Building on the first three articles, there are no obvious technical difficulties in this piece. Readers can also view it simply as a supplementary exercise for practice.

When reposting, please include the original address of this article: https://kexue.fm/archives/11241

For more detailed reposting matters, please refer to: “Scientific Space FAQ”