How to More Scientifically Estimate the Spectral Norm of a Matrix? · English (unofficial) translations of posts at kexue.fm

Spectral Norm

The spectral norm is a core concept in matrix analysis and plays an important role in deep learning—from the Lipschitz constraints required in the era of WGANs, to the training stability of modern LLMs, and to the emerging Muon optimizer—all closely related to the spectral norm of matrix parameters. Therefore, how to estimate the spectral norm efficiently and accurately is becoming increasingly important and worthy of in-depth study.

As is well known, power iteration is the standard method for estimating the spectral norm, but there is still significant room for improvement. This article will briefly summarize some ideas for estimating the spectral norm, including improving the convergence speed of power iteration and how to estimate strict upper bounds for the spectral norm, etc.

Spectral Norm

The definition of the spectral norm is \Vert\boldsymbol{W}\Vert_2 = \max_{\Vert\boldsymbol{x}\Vert_2=1} \Vert \boldsymbol{W}\boldsymbol{x}\Vert_2 where \boldsymbol{W}\in\mathbb{R}^{n\times m}, \boldsymbol{x}\in\mathbb{R}^m, and without loss of generality, assume n\geq m. From the definition, the spectral norm represents the “expansion factor” of a linear layer—the length of the output vector after passing through the linear layer is at most \Vert\boldsymbol{W}\Vert_2 times that of the input. The spectral norm also equals the maximum singular value of the matrix (see “The Road of Low-Rank Approximation (II): SVD” for a proof). We will not elaborate further on these basics.

The author first encountered the spectral norm during the peak of GANs, when WGAN emerged and emphasized the necessity of the discriminator satisfying Lipschitz constraints. The Lipschitz constant of a linear layer \boldsymbol{W}\boldsymbol{x} is exactly the spectral norm of the matrix \boldsymbol{W}, which gave rise to techniques such as spectral normalization and spectral regularization. Interested readers can refer to “Lipschitz Constraints in Deep Learning: Generalization and Generative Models”.

In recent years, the emerging Muon optimizer is also closely related to the spectral norm, often regarded as steepest descent under the spectral norm. Meanwhile, in “Beyond MuP: 4. Maintaining Parameter Stability”, we also proposed constraining the spectral norm of parameters to ensure training stability. All these aspects demand accurate computation of the spectral norm.

Power Iteration

Computing the spectral norm via SVD is the most direct method, but it is obviously too expensive. We usually use power iteration: \boldsymbol{v}^{(t)} = \frac{\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}^{(t-1)}}{\Vert\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}^{(t-1)}\Vert_2} It converges to the right principal singular vector \boldsymbol{v}_1 at a rate of (\sigma_2/\sigma_1)^{2t}. After T iterations, we have \sigma_1 \approx \Vert\boldsymbol{W}\boldsymbol{v}^{(t)}\Vert_2 For a proof, see “Thinking from Spectral Norm Gradient to New-style Weight Decay”. In practice, if we only care about the spectral norm and not the singular vectors, the convergence is often more optimistic than (\sigma_2/\sigma_1)^{2t}. Furthermore, the result of power iteration is prone to fluctuations due to the initial value \boldsymbol{v}^{(0)}; we can consider running k parallel chains and taking the maximum.

If we compute k chains simultaneously and replace the L2 normalization at each iteration with orthogonalization of these k vectors, i.e., \boldsymbol{V}^{(t)} = \mathop{\text{QR}}(\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{V}^{(t-1)}) then the result will converge to the first k right singular vectors, from which we can obtain the first k singular values. For the principle, see “Muon Implementation Based on Streaming Power Iteration: 4. Principles”. However, we will not expand on this extension here, and will focus on estimating the spectral norm—i.e., the maximum singular value.

Computing Gradient

First, here is a reference implementation of power iteration in JAX:

import jax, jax.lax as lax
import jax.numpy as jnp

@jax.jit(static_argnums=(1,))
def l2_normalize(x, axis=-2):
    return x / jnp.linalg.vector_norm(x, axis=axis, keepdims=True)

@jax.jit(static_argnums=(1, 2, 3))
def spec_norm_v1(w, T=10, k=1, key=42):
    v_shape = w.shape[:-2] + w.shape[-1:] + (k,)
    v_init = jax.random.normal(jax.random.PRNGKey(key), v_shape)
    v_step = lambda i, v: l2_normalize(w.mT @ (w @ v))
    v = lax.fori_loop(0, T, v_step, v_init)
    return jnp.linalg.vector_norm(w @ v, axis=-2).max(axis=-1)

Clearly, the complexity of power iteration is \mathcal{O}(mnTk), which is already quite ideal in terms of complexity alone. If we only use it for monitoring, the above implementation is sufficient. However, if we need spectral normalization or spectral regularization, we need the gradient of the spectral norm. With the above approach, we would need to backpropagate through the loop of power iterations, leading to high computational complexity.

In fact, in “Thinking from Spectral Norm Gradient to New-style Weight Decay”, we already derived the gradient of the spectral norm as \nabla_{\boldsymbol{W}}\sigma_1 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}. Therefore, we can directly define its gradient as \nabla_{\boldsymbol{W}}\sigma_1 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top} to reduce the complexity of backpropagation. Alternatively, we can leverage \sigma_1 = \boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1 and output \sigma_1 = \textcolor{skyblue}{[}\boldsymbol{u}_1^{\top}\textcolor{skyblue}{]_{\text{sg}}}\boldsymbol{W}\textcolor{skyblue}{[}\boldsymbol{v}_1\textcolor{skyblue}{]_{\text{sg}}} where \textcolor{skyblue}{[\cdot]_{\text{sg}}} is the stop gradient operator. In this way, automatic differentiation will only differentiate with respect to \boldsymbol{W}, yielding exactly \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}, and avoid backpropagating through the inner loop of power iteration for \boldsymbol{u}_1,\boldsymbol{v}_1.

Waste Not

Through T steps of power iteration, we obtain a sequence of vectors \boldsymbol{v}^{(1)},\boldsymbol{v}^{(2)},\cdots,\boldsymbol{v}^{(T)}, but ultimately we only use \boldsymbol{v}^{(T)} to estimate \sigma_1. This seems rather “wasteful”, so some have thought of using all of them to form a better estimate of the spectral norm.

Normally, \boldsymbol{v}^{(1)},\boldsymbol{v}^{(2)},\cdots,\boldsymbol{v}^{(T)} are different from each other, but they become increasingly close to \boldsymbol{v}_1. We can imagine them as “\boldsymbol{v}_1 plus some noise”. If we put them together for “denoising”, it is indeed possible to obtain a more accurate result. Specifically, we consider building a better approximation to \boldsymbol{v}_1 through a linear combination of \boldsymbol{v}^{(1)},\boldsymbol{v}^{(2)},\cdots,\boldsymbol{v}^{(T)}, and the subspace spanned by these vectors is called the Krylov subspace.

For simplicity, we first perform an orthogonalization to obtain an equivalent orthonormal basis \boldsymbol{Q} = [\boldsymbol{q}_1,\boldsymbol{q}_2,\cdots,\boldsymbol{q}_T]\in\mathbb{R}^{m\times T}. Then we want to find coefficients \boldsymbol{x}=[x_1,x_2,\cdots,x_T]^{\top}\in\mathbb{R}^T such that the vector \sum_{i=1}^T x_i \boldsymbol{q}_i = \boldsymbol{Q}\boldsymbol{x} satisfies \boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{Q}\boldsymbol{x}\approx \sigma_1^2\boldsymbol{Q}\boldsymbol{x} Multiplying both sides by \boldsymbol{Q}^{\top} gives \boldsymbol{Q}^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{Q}\boldsymbol{x}\approx \sigma_1^2\boldsymbol{x}, which indicates that \sigma_1^2 and \boldsymbol{x} are respectively the eigenvalue and eigenvector of (\boldsymbol{W}\boldsymbol{Q})^{\top}\boldsymbol{W}\boldsymbol{Q}. This means we only need to perform an eigendecomposition of this matrix, obtain its largest eigenvalue and then take the square root to get a better estimate. Since this is only a T\times T matrix, when T is relatively small, the eigendecomposition is not expensive, so we can directly call a standard library function.

This idea is essentially a simplified version of the Lanczos algorithm used in modern SVD solvers; practical SVD implementations involve many more refined details. Moreover, this acceleration approach shares similarities with Randomized SVD, which typically uses a Gaussian random matrix for projection, whereas here we use the orthonormal basis from the Krylov subspace for projection.

Acceleration Technique

The reference code incorporating Krylov subspace acceleration for power iteration is as follows:

@jax.jit(static_argnums=(1, 2))
def spec_norm_v2(w, T=10, key=42):
    v_shape = w.shape[:-2] + w.shape[-1:] + (1,)
    v_init = jax.random.normal(jax.random.PRNGKey(key), v_shape)
    v_step = lambda v, _: (l2_normalize(w.mT @ (w @ v)),) * 2
    v = lax.scan(v_step, v_init, length=T)[1].swapaxes(0, -1)[0]
    v = jnp.linalg.qr(v)[0]
    return jnp.linalg.eigvalsh((u := w @ v).mT @ u)[..., -1]**0.5

Experiments show that the subspace acceleration is remarkably effective: at T=5, its accuracy already surpasses that of the original power iteration with T=10, and it is faster. At T=10, roughly 2/3 of the time is spent on power iterations, 1/3 on QR decomposition, while the eigendecomposition takes a very small fraction of the time.

Readers familiar with the “Streaming Power Iteration” series (see e.g., the previously referenced article) might think of using the Cholesky QR introduced there to accelerate the QR decomposition. Unfortunately, because the results of power iteration become increasingly collinear, the condition number of the matrix to be QR-factorized deteriorates steadily, which is exactly the regime where Cholesky QR “fails”—the failure rate is extremely high, so it is basically unusable for acceleration here.

A more effective acceleration strategy is to use only the results of the last few steps of power iteration, say the last three steps \boldsymbol{v}^{(T-2)},\boldsymbol{v}^{(T-1)},\boldsymbol{v}^{(T)}, to construct the Krylov subspace and accelerate. This already captures most of the benefits while reducing the computational cost of QR decomposition and eigendecomposition, thereby achieving speedup.

Estimating Upper Bound

Power iteration and its accelerated versions can strictly only provide a lower bound for the spectral norm, though this is often sufficient. However, in some scenarios we must have an upper bound—for instance, when we want to ensure the norm of a matrix is strictly less than 1 via spectral normalization—we need to find another approach.

A basic scheme in this case is to compute the Schatten norm, which provides an upper bound for the spectral norm. Let the SVD of matrix \boldsymbol{W} be \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}, with singular values \sigma_1\geq\sigma_2\geq\cdots\geq\sigma_m. Its Schatten-p norm is defined as \Vert\boldsymbol{W}\Vert_{S,p} = \sqrt[\uproot{10}p]{\sum_{i=1}^m \sigma_i^p} \quad\geq\quad \sigma_1 The larger p is, the closer it is to the true value. When p is even, the Schatten-p norm is relatively feasible to compute, mainly due to the following identities: \begin{gathered} \Vert\boldsymbol{W}\Vert_{S,2k} = \sqrt[\uproot{10}2k]{\sum_{i=1}^m \sigma_i^{2k}} = \sqrt[\uproot{3}2k]{\mathop{\text{tr}}(\boldsymbol{V}\boldsymbol{\Sigma}^{2k}\boldsymbol{V}^{\top})} = \sqrt[\uproot{3}2k]{\mathop{\text{tr}}((\boldsymbol{W}^{\top}\boldsymbol{W})^k)} \\ \Vert\boldsymbol{W}\Vert_{S,4k} = \sqrt[\uproot{10}4k]{\sum_{i=1}^m \sigma_i^{4k}} = \sqrt[\uproot{3}2k]{\Vert\boldsymbol{V}\boldsymbol{\Sigma}^{2k}\boldsymbol{V}^{\top}\Vert_{S,2}} = \sqrt[\uproot{3}2k]{\Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^k\Vert_{S,2}}\label{eq:S-4k} \end{gathered} It is easy to see that the Schatten-2 norm is actually the Frobenius norm, which can be computed by summing the squares of all elements and taking the square root. Therefore, the two identities above show that once we compute (\boldsymbol{W}^{\top}\boldsymbol{W})^k, we can obtain \Vert\boldsymbol{W}\Vert_{S,2k} and \Vert\boldsymbol{W}\Vert_{S,4k} at relatively low cost.

The first step \boldsymbol{W}^{\top}\boldsymbol{W} has complexity \mathcal{O}(nm^2), and subsequent repeated squarings have complexity \mathcal{O}(m^3) each. Thus, with a complexity of \mathcal{O}(nm^2 + Tm^3), we can compute (\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T}. In theory, this is far higher than power iteration; in fact, when n,m are large, the very first step \boldsymbol{W}^{\top}\boldsymbol{W} already dwarfs power iteration. Therefore, if you only want to estimate the spectral norm without requiring an upper bound, you should still prioritize power iteration.

However, matrix multiplication can usually be fully parallelized, so when n,m are not too large or n\gg m, the Schatten-p norm will not be a computational bottleneck, and we can consider it. Or, when we absolutely must have a strict upper bound, this seems to be the most straightforward route.

Preserving Numerical Stability

To compute \Vert\boldsymbol{W}\Vert_{S,2^{T+2}}, a naive implementation would start with \boldsymbol{M} = \boldsymbol{W}^{\top}\boldsymbol{W}, repeatedly execute \boldsymbol{M}\leftarrow \boldsymbol{M}^2 for T times to obtain (\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T}, and then plug it into equation [eq:S-4k]. However, because of the super-exponential nature, the values quickly explode to NaN or collapse to zero.

In general, \Vert\boldsymbol{W}\Vert_{S,2^{T+2}} itself does not numerically explode; the problem lies in explicitly computing (\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T}. The solution is to re-normalize after each squaring: \boldsymbol{M}\leftarrow \boldsymbol{M}^2/\mathop{\text{tr}}(\boldsymbol{M}^2). This prevents numerical explosion and ensures the scaling remains tight enough to avoid collapsing to zero. Meanwhile, we accumulate the normalization factors from each step in the logarithmic domain for the final computation of \Vert\boldsymbol{W}\Vert_{S,2^{T+2}}.

The computational procedure is summarized as follows: \begin{array}{|l|} \hline \text{Compute }\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}\text{ as an upper bound for the spectral norm} \\[4pt] \hline \begin{array}{ll} 1: & \text{Initialize }\log S = \log\mathop{\text{tr}}(\boldsymbol{W}^{\top}\boldsymbol{W}), \boldsymbol{M}=\frac{\boldsymbol{W}^{\top}\boldsymbol{W}}{ \mathop{\text{tr}}(\boldsymbol{W}^{\top}\boldsymbol{W})}\\[4pt] 2: & \textbf{For }t=1,2,\cdots,T\textbf{ do } \\[4pt] 3: & \qquad \log S\leftarrow 2\log S + \log \mathop{\text{tr}}(\boldsymbol{M}^2) \\[4pt] 4: & \qquad \boldsymbol{M} \leftarrow \frac{\boldsymbol{M}^2}{ \mathop{\text{tr}}(\boldsymbol{M}^2)} \\[4pt] 5: & \text{Output } \exp\left(\frac{\log S + \log\Vert\boldsymbol{M}\Vert_F}{2^{T+1}}\right) \end{array} \\[8pt] \hline \end{array}

A simple reference implementation:

@jax.jit
def tr(w):
    return w.trace(axis1=-1, axis2=-2)[..., None, None]

@jax.jit(static_argnums=(1,))
def spec_norm_v3(w, T=5):
    m = (m := w.mT @ w) / (s := tr(m))
    ms_step = lambda i, ms: ((m := ms[0] @ ms[0]) / (s := tr(m)), 2 * ms[1] + jnp.log(s))
    m, logs = lax.fori_loop(0, T, ms_step, (m, jnp.log(s)))
    logf = 0.5 * jnp.log((m**2).sum(axis=[-1, -2], keepdims=True))
    return jnp.exp((logs + logf) / 2**(T + 1))

Higher-Order Moments

In order to compute \Vert\boldsymbol{W}\Vert_{S,2^{T+2}}, we need to compute \boldsymbol{W}^{\top}\boldsymbol{W},\cdots,(\boldsymbol{W}^{\top}\boldsymbol{W})^{2^{T-1}},(\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T} in some fashion. This means we can simultaneously obtain \Vert\boldsymbol{W}\Vert_{S,2},\cdots,\Vert\boldsymbol{W}\Vert_{S,2^{T+1}},\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}, but in the end only \Vert\boldsymbol{W}\Vert_{S,2^{T+2}} is used, which again seems somewhat “wasteful”.

Is there a way, similar to the Krylov subspace method for power iteration, to use all these results to improve estimation accuracy? Indeed there is! “Fast Tight Spectral-Norm Bounds” provides an approach to refine the estimate via nonlinear programming. Let us first consider a simple case: suppose we have computed the second and fourth moments of the singular values: \sum_{i=1}^m \sigma_i^2 = S_2, \qquad \sum_{i=1}^m \sigma_i^4 = S_4 Then, according to the results from the previous two sections, S_4^{1/4} is closer to the spectral norm than S_2^{1/2}, so we return S_4^{1/4} as the upper bound, discarding S_2. But is S_2 really useless? Imagine if m=2, then we have exactly two equations and two unknowns; in theory we could solve for \sigma_1,\sigma_2! When m>2, an exact solution is impossible, but we can still narrow the range for \sigma_1. Specifically, we have \frac{S_2 - \sigma_1^2}{m-1} = \frac{1}{m-1}\sum_{i=2}^m \sigma_i^2 \leq \sqrt{\frac{1}{m-1}\sum_{i=2}^m \sigma_i^4} = \sqrt{\frac{S_4 - \sigma_1^4}{m-1}} i.e., (S_2 - \sigma_1^2)^2\leq (m-1)(S_4 - \sigma_1^4). This is essentially a quadratic inequality in one variable, which can easily be solved to give \sigma_1 \leq \sqrt{\frac{S_2+\sqrt{(m-1)(mS_4-S_2^2)}}{m}}\label{eq:s2-s4} This provides an upper bound for \sigma_1 expressed in terms of S_2 and S_4, which is a better estimate than S_4^{1/4}. “Fast Tight Spectral-Norm Bounds” also extends this to simultaneously using S_2,S_4,S_6,S_8, yielding tighter but more cumbersome expressions. Interested readers can refer to the original paper for details. Using even higher-order moments to improve the estimate is theoretically possible, but in practice it often requires solving complex nonlinear systems and is of limited practical value.

Limitations

In theory, the result [eq:s2-s4] can be generalized to using arbitrary S_{2k} and S_{4k} to obtain a more accurate upper bound: \sigma_1 \leq \sqrt[\uproot{10}2k]{\frac{S_{2k}+\sqrt{(m-1)(mS_{4k}-S_{2k}^2)}}{m}} However, when k is large, the practical significance of this result is very limited, because it requires explicitly computing S_{2k} and S_{4k}, which will explode or collapse when k is large—this is not realistic. A possible improvement is to factor out S_{4k}^{1/4k}: \sqrt[\uproot{10}2k]{\frac{S_{2k}+\sqrt{(m-1)(mS_{4k}-S_{2k}^2)}}{m}} = S_{4k}^{1/4k}\cdot\sqrt[\uproot{10}2k]{\frac{S_{2k}/S_{4k}^{1/2}+\sqrt{(m-1)(m-S_{2k}^2/S_{4k})}}{m}} Then let S_{2k}/S_{4k}^{1/2}=e^{\epsilon}, and perform an approximate expansion in the logarithmic domain: \sqrt[\uproot{10}2k]{\frac{S_{2k}/S_{4k}^{1/2}+\sqrt{(m-1)(m-S_{2k}^2/S_{4k})}}{m}} \approx 1 - \frac{\epsilon^2}{4k(m-1)} However, our goal is to obtain an upper bound for the spectral norm. How to guarantee the upper-bound property while performing an approximate expansion appears rather complex. On the other hand, if we have already computed up to a relatively large k, the accuracy of S_{4k}^{1/4k} itself is already quite high, and the benefit of using programming ideas to further improve precision is marginal.

Conclusion

This article has briefly summarized several approaches for estimating the spectral norm. In practical applications, if you only need to approximately monitor the spectral norm, power iteration and its subspace-accelerated version are usually sufficient; if a strict upper bound must be guaranteed, consider computing the Schatten norm. The “waste not” philosophy extended from both routes—Krylov subspace leveraging iteration history, and nonlinear programming leveraging higher-order moment information—also provides valuable inspiration for algorithm design.

For reprinting, please include the address of this article: https://kexue.fm/archives/11736

For more detailed reprinting guidelines, please refer to: Science Space FAQ