English (unofficial) translations of posts at kexue.fm
Source

Computing Singular Value Clipping (mclip) via msign (Part 1)

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

In two previous articles, "Newton-Schulz Iteration for the msign Operator (Part 1)" and "Newton-Schulz Iteration for the msign Operator (Part 2)", we discussed the numerical computation of the \mathop{\text{msign}} operator for matrices. In this article, we focus on the "Singular Value Clipping" operation, which has recently sparked heated discussions on @_arohan_’s Twitter. We also mentioned it previously in "Higher-order MuP: Simpler but Smarter Spectral Condition Scaling". Hereafter, we will refer to it as \mathop{\text{mclip}}.

Basic Concepts

For a scalar x, the \mathop{\text{clip}} operation is defined as: \begin{equation} \mathop{\text{clip}}(x) = \max(\min(x, 1), -1) = \left\{\begin{aligned} 1, & \quad x \geq 1 \\ x, & \quad x \in (-1, 1) \\ -1, & \quad x \leq -1 \end{aligned}\right. \end{equation} That is, values greater than 1 or less than -1 are truncated; otherwise, they remain unchanged. We define the \mathop{\text{mclip}} of a matrix \boldsymbol{M} \in \mathbb{R}^{n \times m} as: \begin{equation} \mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{U} \mathop{\text{clip}}(\boldsymbol{\Sigma}) \boldsymbol{V}^{\top} \end{equation} where \boldsymbol{M} = \boldsymbol{U} \boldsymbol{\Sigma} \boldsymbol{V}^{\top} is the SVD of matrix \boldsymbol{M}, \boldsymbol{U} \in \mathbb{R}^{n \times n} and \boldsymbol{V} \in \mathbb{R}^{m \times m} are orthogonal matrices, and \boldsymbol{\Sigma} \in \mathbb{R}^{n \times m} is the diagonal matrix of singular values. Applying \mathop{\text{clip}} to a diagonal matrix means applying \mathop{\text{clip}} to each of its diagonal elements individually. Note that since the matrix \boldsymbol{\Sigma} is diagonal and its elements (singular values) are always non-negative, we also have: \begin{equation} \mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{U} \min(\boldsymbol{\Sigma}, 1) \boldsymbol{V}^{\top} \end{equation}

SVD is naturally the standard way to compute \mathop{\text{mclip}}, but SVD is not very efficient. Given our previous experience with \mathop{\text{msign}}, it is natural to think of finding a Newton-Schulz iteration to compute \mathop{\text{mclip}}, just as we did for \mathop{\text{msign}}. While this line of reasoning is sound, we can actually find a cleverer method based on \mathop{\text{msign}}.

Standing on the Shoulders of Giants

This clever idea comes from @leloykun. In his blog post "Numerically Stable Spectral Clipping Via Newton-Schulz Iteration", he proposed standing on the shoulders of \mathop{\text{msign}} by expressing \mathop{\text{mclip}} in terms of \mathop{\text{msign}}, thereby avoiding the need to find a separate Newton-Schulz iteration. He provided an ingenious solution in his blog, but I personally find it somewhat unintuitive and not particularly efficient. Below, I present my own approach.

My starting point is a scalar identity (found with the help of Kimi): \begin{equation*} \min(x, 1) = \frac{1}{2} [x + 1 - (x-1)\mathop{\text{sign}}(x-1)] \end{equation*} For simplicity, let’s first assume \boldsymbol{M} is a full-rank square matrix. Then: \begin{equation} \begin{aligned} 2\mathop{\text{mclip}}(\boldsymbol{M}) &= \boldsymbol{U} [2\min(\boldsymbol{\Sigma}, 1)] \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{U} [\boldsymbol{\Sigma} + \boldsymbol{I} - (\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{sign}}(\boldsymbol{\Sigma} - \boldsymbol{I})] \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{U} [\boldsymbol{\Sigma} + \boldsymbol{I} - (\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I})] \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{M} + \boldsymbol{U}\boldsymbol{V}^{\top} - \boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \end{aligned} \label{eq:2-mclip-M} \end{equation} Note that: \begin{equation} \begin{aligned} &\boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I})\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \\[6pt] &= \boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{U}^{\top} \boldsymbol{U}\mathop{\text{msign}}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \\[6pt] &= (\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{U}^{\top} - \boldsymbol{I}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] &= (\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top} (\boldsymbol{U}\boldsymbol{V}^{\top})^{\top} - \boldsymbol{I}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] &= (\boldsymbol{M} (\boldsymbol{U}\boldsymbol{V}^{\top})^{\top} - \boldsymbol{I}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] \end{aligned} \end{equation} where the second equality uses the fact that for any orthogonal matrices \boldsymbol{P}, \boldsymbol{Q}, the identity \boldsymbol{P}\mathop{\text{msign}}(\boldsymbol{R})\boldsymbol{Q} = \mathop{\text{msign}}(\boldsymbol{P}\boldsymbol{R}\boldsymbol{Q}) holds. Substituting this back into Equation [eq:2-mclip-M], we get: \begin{equation} 2\mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{M} + \boldsymbol{U}\boldsymbol{V}^{\top} + (\boldsymbol{I} - \boldsymbol{M}(\boldsymbol{U}\boldsymbol{V}^{\top})^{\top}) \mathop{\text{msign}}(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \label{eq:mclip-M-core} \end{equation} If \boldsymbol{M} is a general matrix of rank r, then \boldsymbol{U}\boldsymbol{V}^{\top} is replaced by \boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top}. We can verify the equality by directly substituting \boldsymbol{M} = \boldsymbol{U}_{[:,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:,:r]}^{\top} into the above equation.

(Note: I would like to thank @YouJiacheng for the discussions in this section.)

Reference Implementation

We know that \boldsymbol{U}\boldsymbol{V}^{\top} = \mathop{\text{msign}}(\boldsymbol{M}), so computing \mathop{\text{mclip}} using Equation [eq:mclip-M-core] only requires computing \mathop{\text{msign}} twice: \begin{equation} 2\mathop{\text{mclip}}(\boldsymbol{M}) = \boldsymbol{M} + \mathop{\text{msign}}(\boldsymbol{M}) + (\boldsymbol{I} - \boldsymbol{M}\mathop{\text{msign}}(\boldsymbol{M})^{\top}) \mathop{\text{msign}}(\boldsymbol{M} - \mathop{\text{msign}}(\boldsymbol{M})) \end{equation}

The computational cost is roughly twice that of \mathop{\text{msign}}. In contrast, the method in "Numerically Stable Spectral Clipping Via Newton-Schulz Iteration" requires computing \mathop{\text{msign}} for a matrix approximately four times the size, which results in a computational cost roughly eight times that of \mathop{\text{msign}}.

Based on \mathop{\text{msign}}, implementing Equation [eq:mclip-M-core] takes as few as two lines of code. A reference is provided below:

import numpy as np

def msign(m):
    u, s, vh = np.linalg.svd(m, full_matrices=False)
    return u @ vh

def mclip(m):
    ms2 = msign(m - (ms := msign(m)))
    return (m + ms + ms2 - m @ ms.mT @ ms2) / 2

m = np.random.randn(10, 20)
u, s, vh = np.linalg.svd(m, full_matrices=False)

result1 = u @ np.diag(s.clip(0, 1)) @ vh
result2 = mclip(m)
print(np.abs(result1 - result2).mean())

Here, SVD is used directly to compute \mathop{\text{msign}} to quickly verify the correctness of Equation [eq:mclip-M-core]. In actual computations, readers can replace the \mathop{\text{msign}} function with the corresponding Newton-Schulz iteration.

Other Functions

We can use the same logic to compute matrix versions of other functions, such as the step function. We define the scalar step function as: \begin{equation} \mathop{\text{step}}(x) = \frac{1}{2}[\mathop{\text{sign}}(x - 1) + 1] \end{equation} This means values greater than 1 become 1, and values less than 1 become 0. Thus, we can define: \begin{equation} \mathop{\text{mstep}}(\boldsymbol{M}) = \boldsymbol{U}\mathop{\text{step}}(\boldsymbol{\Sigma})\boldsymbol{V}^{\top} \end{equation} which means only singular values greater than 1 are kept and truncated to 1, while singular values less than 1 are set to zero. Following the same steps, we obtain: \begin{equation} \mathop{\text{mstep}}(\boldsymbol{M}) = \frac{1}{2}[\mathop{\text{msign}}(\boldsymbol{M}) + \mathop{\text{msign}}(\boldsymbol{M} - \mathop{\text{msign}}(\boldsymbol{M}))] \end{equation} We can even represent even functions. For example, define: \begin{equation} \text{msquare}(\boldsymbol{M}) = \boldsymbol{U} \boldsymbol{\Sigma}^2\boldsymbol{V}^{\top} = \boldsymbol{U}\boldsymbol{V}^{\top}(\boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{U}^{\top})(\boldsymbol{U} \boldsymbol{\Sigma}\boldsymbol{V}^{\top}) = \mathop{\text{msign}}(\boldsymbol{M})\boldsymbol{M}^{\top}\boldsymbol{M} \end{equation} This is different from the matrix square defined directly by \boldsymbol{M}^2; the latter is only valid for square matrices and squares the eigenvalues under eigenvalue decomposition, whereas the above squares the singular values under singular value decomposition. Generally, we have: \begin{equation} \boldsymbol{U} \boldsymbol{\Sigma}^{2n}\boldsymbol{V}^{\top} = \mathop{\text{msign}}(\boldsymbol{M})(\boldsymbol{M}^{\top}\boldsymbol{M})^n, \quad \boldsymbol{U} \boldsymbol{\Sigma}^{2n+1}\boldsymbol{V}^{\top} = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^n \end{equation} This indicates that for any polynomial f(x) (not just odd polynomials), \boldsymbol{U}f(\boldsymbol{\Sigma})\boldsymbol{V}^{\top} can be obtained from \boldsymbol{M}, \mathop{\text{msign}}(\boldsymbol{M}), and a finite number of matrix additions and multiplications.

Summary

This article introduced an approach to performing general operations on the singular values of a matrix using the matrix itself and its \mathop{\text{msign}}. This includes singular value clipping, step functions, and arbitrary-degree polynomials (not just odd ones).

Original URL: https://kexue.fm/archives/11006