In two previous articles, "Newton-Schulz Iteration for the msign Operator (Part 1)" and "Newton-Schulz Iteration for the msign Operator (Part 2)", we discussed the numerical computation of the \mathop{\text{msign}} operator for matrices. In this article, we focus on the "Singular Value Clipping" operation, which has recently sparked heated discussions on @_arohan_’s Twitter. We also mentioned it previously in "Higher-order MuP: Simpler but Smarter Spectral Condition Scaling". Hereafter, we will refer to it as \mathop{\text{mclip}}.
Basic Concepts
For a scalar x, the \mathop{\text{clip}} operation is defined as: \begin{equation} \mathop{\text{clip}}(x) = \max(\min(x, 1), -1) = \left\{\begin{aligned} 1, & \quad x \geq 1 \\ x, & \quad x \in (-1, 1) \\ -1, & \quad x \leq -1 \end{aligned}\right. \end{equation} That is, values greater than 1 or less than -1 are truncated; otherwise, they remain unchanged. We define the \mathop{\text{mclip}} of a matrix \boldsymbol{M} \in \mathbb{R}^{n \times m} as: \begin{equation} \mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{U} \mathop{\text{clip}}(\boldsymbol{\Sigma}) \boldsymbol{V}^{\top} \end{equation} where \boldsymbol{M} = \boldsymbol{U} \boldsymbol{\Sigma} \boldsymbol{V}^{\top} is the SVD of matrix \boldsymbol{M}, \boldsymbol{U} \in \mathbb{R}^{n \times n} and \boldsymbol{V} \in \mathbb{R}^{m \times m} are orthogonal matrices, and \boldsymbol{\Sigma} \in \mathbb{R}^{n \times m} is the diagonal matrix of singular values. Applying \mathop{\text{clip}} to a diagonal matrix means applying \mathop{\text{clip}} to each of its diagonal elements individually. Note that since the matrix \boldsymbol{\Sigma} is diagonal and its elements (singular values) are always non-negative, we also have: \begin{equation} \mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{U} \min(\boldsymbol{\Sigma}, 1) \boldsymbol{V}^{\top} \end{equation}
SVD is naturally the standard way to compute \mathop{\text{mclip}}, but SVD is not very efficient. Given our previous experience with \mathop{\text{msign}}, it is natural to think of finding a Newton-Schulz iteration to compute \mathop{\text{mclip}}, just as we did for \mathop{\text{msign}}. While this line of reasoning is sound, we can actually find a cleverer method based on \mathop{\text{msign}}.
Standing on the Shoulders of Giants
This clever idea comes from @leloykun. In his blog post "Numerically Stable Spectral Clipping Via Newton-Schulz Iteration", he proposed standing on the shoulders of \mathop{\text{msign}} by expressing \mathop{\text{mclip}} in terms of \mathop{\text{msign}}, thereby avoiding the need to find a separate Newton-Schulz iteration. He provided an ingenious solution in his blog, but I personally find it somewhat unintuitive and not particularly efficient. Below, I present my own approach.
My starting point is a scalar identity (found with the help of Kimi): \begin{equation*} \min(x, 1) = \frac{1}{2} [x + 1 - (x-1)\mathop{\text{sign}}(x-1)] \end{equation*} For simplicity, let’s first assume \boldsymbol{M} is a full-rank square matrix. Then: \begin{equation} \begin{aligned} 2\mathop{\text{mclip}}(\boldsymbol{M}) &= \boldsymbol{U} [2\min(\boldsymbol{\Sigma}, 1)] \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{U} [\boldsymbol{\Sigma} + \boldsymbol{I} - (\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{sign}}(\boldsymbol{\Sigma} - \boldsymbol{I})] \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{U} [\boldsymbol{\Sigma} + \boldsymbol{I} - (\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I})] \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{M} + \boldsymbol{U}\boldsymbol{V}^{\top} - \boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \end{aligned} \label{eq:2-mclip-M} \end{equation} Note that: \begin{equation} \begin{aligned} &\boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{U}^{\top} \boldsymbol{U}\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \\[6pt] &= (\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{U}^{\top} - \boldsymbol{I}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] &= (\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top} (\boldsymbol{U}\boldsymbol{V}^{\top})^{\top} - \boldsymbol{I}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] &= (\boldsymbol{M} (\boldsymbol{U}\boldsymbol{V}^{\top})^{\top} - \boldsymbol{I}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] \end{aligned} \end{equation} where the second equality uses the fact that for any orthogonal matrices \boldsymbol{P}, \boldsymbol{Q}, the identity \boldsymbol{P}\mathop{\text{msign}}(\boldsymbol{R})\boldsymbol{Q} = \mathop{\text{msign}}(\boldsymbol{P}\boldsymbol{R}\boldsymbol{Q}) holds. Substituting this back into Equation [eq:2-mclip-M], we get: \begin{equation} 2\mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{M} + \boldsymbol{U}\boldsymbol{V}^{\top} + (\boldsymbol{I} - \boldsymbol{M}(\boldsymbol{U}\boldsymbol{V}^{\top})^{\top}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \label{eq:mclip-M-core} \end{equation} If \boldsymbol{M} is a general matrix of rank r, then \boldsymbol{U}\boldsymbol{V}^{\top} is replaced by \boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top}. We can verify the equality by directly substituting \boldsymbol{M} = \boldsymbol{U}_{[:,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:,:r]}^{\top} into the above equation.
(Note: I would like to thank @YouJiacheng for the discussions in this section.)
Reference Implementation
We know that \boldsymbol{U}\boldsymbol{V}^{\top} = \mathop{\text{msign}}(\boldsymbol{M}), so computing \mathop{\text{mclip}} using Equation [eq:mclip-M-core] only requires computing \mathop{\text{msign}} twice: \begin{equation} 2\mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{M} + \mathop{\text{msign}}(\boldsymbol{M}) + (\boldsymbol{I} - \boldsymbol{M}\mathop{\text{msign}}(\boldsymbol{M})^{\top}) \mathop{\text{msign}}(\boldsymbol{M} - \mathop{\text{msign}}(\boldsymbol{M})) \end{equation}
The computational cost is roughly twice that of \mathop{\text{msign}}. In contrast, the method in "Numerically Stable Spectral Clipping Via Newton-Schulz Iteration" requires computing \mathop{\text{msign}} for a matrix approximately four times the size, which results in a computational cost roughly eight times that of \mathop{\text{msign}}.
Based on \mathop{\text{msign}}, implementing Equation [eq:mclip-M-core] takes as few as two lines of code. A reference is provided below:
import numpy as np
def msign(m):
u, s, vh = np.linalg.svd(m, full_matrices=False)
return u @ vh
def mclip(m):
ms2 = msign(m - (ms := msign(m)))
return (m + ms + ms2 - m @ ms.mT @ ms2) / 2
m = np.random.randn(10, 20)
u, s, vh = np.linalg.svd(m, full_matrices=False)
result1 = u @ np.diag(s.clip(0, 1)) @ vh
result2 = mclip(m)
print(np.abs(result1 - result2).mean())Here, SVD is used directly to compute \mathop{\text{msign}} to quickly verify the correctness of Equation [eq:mclip-M-core]. In actual computations, readers can replace the \mathop{\text{msign}} function with the corresponding Newton-Schulz iteration.
Other Functions
We can use the same logic to compute matrix versions of other functions, such as the step function. We define the scalar step function as: \begin{equation} \mathop{\text{step}}(x) = \frac{1}{2}[\mathop{\text{sign}}(x - 1) + 1] \end{equation} This means values greater than 1 become 1, and values less than 1 become 0. Thus, we can define: \begin{equation} \mathop{\text{mstep}}(\boldsymbol{M}) = \boldsymbol{U}\mathop{\text{step}}(\boldsymbol{\Sigma})\boldsymbol{V}^{\top} \end{equation} which means only singular values greater than 1 are kept and truncated to 1, while singular values less than 1 are set to zero. Following the same steps, we obtain: \begin{equation} \mathop{\text{mstep}}(\boldsymbol{M}) = \frac{1}{2}[\mathop{\text{msign}}(\boldsymbol{M}) + \mathop{\text{msign}}(\boldsymbol{M} - \mathop{\text{msign}}(\boldsymbol{M}))] \end{equation} We can even represent even functions. For example, define: \begin{equation} \text{msquare}(\boldsymbol{M}) = \boldsymbol{U} \boldsymbol{\Sigma}^2\boldsymbol{V}^{\top} = \boldsymbol{U}\boldsymbol{V}^{\top}(\boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{U}^{\top})(\boldsymbol{U} \boldsymbol{\Sigma}\boldsymbol{V}^{\top}) = \mathop{\text{msign}}(\boldsymbol{M})\boldsymbol{M}^{\top}\boldsymbol{M} \end{equation} This is different from the matrix square defined directly by \boldsymbol{M}^2; the latter is only valid for square matrices and squares the eigenvalues under eigenvalue decomposition, whereas the above squares the singular values under singular value decomposition. Generally, we have: \begin{equation} \boldsymbol{U} \boldsymbol{\Sigma}^{2n}\boldsymbol{V}^{\top} = \mathop{\text{msign}}(\boldsymbol{M})(\boldsymbol{M}^{\top}\boldsymbol{M})^n, \quad \boldsymbol{U} \boldsymbol{\Sigma}^{2n+1}\boldsymbol{V}^{\top} = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^n \end{equation} This indicates that for any polynomial f(x) (not just odd polynomials), \boldsymbol{U}f(\boldsymbol{\Sigma})\boldsymbol{V}^{\top} can be obtained from \boldsymbol{M}, \mathop{\text{msign}}(\boldsymbol{M}), and a finite number of matrix additions and multiplications.
Summary
This article introduced an approach to performing general operations on the singular values of a matrix using the matrix itself and its \mathop{\text{msign}}. This includes singular value clipping, step functions, and arbitrary-degree polynomials (not just odd ones).
Original URL: https://kexue.fm/archives/11006