English (unofficial) translations of posts at kexue.fm
Source

The Derivative of SVD

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

SVD (Singular Value Decomposition) is a common matrix decomposition algorithm that many readers are likely already familiar with. We previously introduced it in "The Path to Low-Rank Approximation (II): SVD". However, have you ever considered that SVD can actually be differentiated? I was quite surprised when I first learned of this conclusion, as intuition suggests that "decomposition" is often non-differentiable. But the fact is, SVD is indeed differentiable under general conditions, which means that theoretically, we can embed SVD into a model and train it end-to-end using gradient-based optimizers.

The question is: since SVD is differentiable, what does its derivative look like? Next, we will follow the reference "Differentiating the Singular Value Decomposition" to step-by-step derive the differentiation formulas for SVD.

Foundations of Derivation

Assume \boldsymbol{W} is a full-rank n \times n matrix, and all singular values are distinct. This is a relatively easy case to discuss, and we will later address which conditions can be relaxed. We denote the SVD of \boldsymbol{W} as: \begin{equation} \boldsymbol{W} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top} \end{equation} The so-called SVD differentiation actually involves finding the gradients or differentials of \boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V} with respect to \boldsymbol{W}. To this end, we first take the differential of both sides of the above equation: \begin{equation} d\boldsymbol{W} = (d\boldsymbol{U})\boldsymbol{\Sigma}\boldsymbol{V}^{\top} + \boldsymbol{U}(d\boldsymbol{\Sigma})\boldsymbol{V}^{\top} + \boldsymbol{U}\boldsymbol{\Sigma}(d\boldsymbol{V})^{\top} \end{equation} Multiplying by \boldsymbol{U}^{\top} on the left and \boldsymbol{V} on the right, and using the identities \boldsymbol{U}^{\top}\boldsymbol{U} = \boldsymbol{I} and \boldsymbol{V}^{\top}\boldsymbol{V} = \boldsymbol{I}, we obtain: \begin{equation} \boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V} = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V} \label{eq:core} \end{equation} This serves as the foundation for the subsequent derivation. Note that by taking the differential of the identities \boldsymbol{U}^{\top}\boldsymbol{U} = \boldsymbol{I} and \boldsymbol{V}^{\top}\boldsymbol{V} = \boldsymbol{I}, we also get: \begin{equation} (d\boldsymbol{U})^{\top}\boldsymbol{U} + \boldsymbol{U}^{\top}(d\boldsymbol{U}) = \boldsymbol{0},\quad (d\boldsymbol{V})^{\top}\boldsymbol{V} + \boldsymbol{V}^{\top}(d\boldsymbol{V}) = \boldsymbol{0} \end{equation} This indicates that \boldsymbol{U}^{\top}(d\boldsymbol{U}) and (d\boldsymbol{V})^{\top}\boldsymbol{V} are both anti-symmetric matrices.

Singular Values

A characteristic of anti-symmetric matrices is that their diagonal elements are all zero. Since \boldsymbol{\Sigma} is a diagonal matrix (where non-diagonal elements are zero), this suggests we should handle the diagonal and non-diagonal elements separately.

First, we define the matrices \boldsymbol{I} and \bar{\boldsymbol{I}}: \boldsymbol{I} is the identity matrix (diagonal elements are 1, non-diagonal elements are 0); \bar{\boldsymbol{I}} is the complementary matrix of the identity (diagonal elements are 0, non-diagonal elements are 1). Using \boldsymbol{I}, \bar{\boldsymbol{I}}, and the Hadamard product \otimes, we can extract the diagonal and non-diagonal parts of equation [eq:core] respectively: \begin{align} \boldsymbol{I}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}) =&\, \boldsymbol{I}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V}) = d\boldsymbol{\Sigma} \label{eq:value} \\[8pt] \bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}) =&\, \bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V}) = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V} \label{eq:vector} \end{align} Now let’s look at equation [eq:value], which can be equivalently written as: \begin{equation} d\sigma_i = \boldsymbol{u}_i^{\top}(d\boldsymbol{W})\boldsymbol{v}_i \label{eq:d-sigma} \end{equation} This is the differential of the i-th singular value \sigma_i, where \boldsymbol{u}_i, \boldsymbol{v}_i are the i-th columns of \boldsymbol{U} and \boldsymbol{V} respectively. The spectral norm gradient discussed in "Thoughts from Spectral Norm Gradient to New Weight Decay" is actually just a special case here when i=1.

Singular Vectors

Next, we look at equation [eq:vector]: \begin{equation} \bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}) = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V} \label{eq:vector-1} \end{equation} Taking the transpose: \begin{equation} \begin{aligned} \bar{\boldsymbol{I}}\otimes(\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}) =&\, \boldsymbol{\Sigma}(d\boldsymbol{U})^{\top}\boldsymbol{U} + \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} \\[6pt] =&\, -\boldsymbol{\Sigma}\boldsymbol{U}^{\top}(d\boldsymbol{U}) - (d\boldsymbol{V})^{\top}\boldsymbol{V}\boldsymbol{\Sigma} \end{aligned} \label{eq:vector-2} \end{equation} The second equality utilizes the fact that \boldsymbol{U}^{\top}(d\boldsymbol{U}) and (d\boldsymbol{V})^{\top}\boldsymbol{V} are anti-symmetric. Equations [eq:vector-1] and [eq:vector-2] form a system of linear equations for d\boldsymbol{U} and d\boldsymbol{V}, which we need to solve.

The solution strategy is standard elimination. First, from \eqref{eq:vector-1} \times \boldsymbol{\Sigma} + \boldsymbol{\Sigma} \times \eqref{eq:vector-2}, we get: \begin{equation} \bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}) = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma}^2 - \boldsymbol{\Sigma}^2\boldsymbol{U}^{\top}(d\boldsymbol{U}) \end{equation} Here we use the fact that for a diagonal matrix \boldsymbol{\Sigma}, \boldsymbol{\Sigma}(\bar{\boldsymbol{I}}\otimes \boldsymbol{M}) = \bar{\boldsymbol{I}}\otimes (\boldsymbol{\Sigma}\boldsymbol{M}) and (\bar{\boldsymbol{I}}\otimes \boldsymbol{M})\boldsymbol{\Sigma} = \bar{\boldsymbol{I}}\otimes (\boldsymbol{M}\boldsymbol{\Sigma}). We know that left (right) multiplying by a diagonal matrix is equivalent to multiplying each row (column) of the matrix by the corresponding element of the diagonal matrix. Thus, if we define a matrix \boldsymbol{E} where \boldsymbol{E}_{i,j} = \sigma_j^2 - \sigma_i^2, then \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma}^2 - \boldsymbol{\Sigma}^2\boldsymbol{U}^{\top}(d\boldsymbol{U}) = \boldsymbol{E}\otimes (\boldsymbol{U}^{\top}(d\boldsymbol{U})). The equation can then be written as: \begin{equation} \bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}) = \boldsymbol{E}\otimes (\boldsymbol{U}^{\top}(d\boldsymbol{U})) \label{eq:dU-0} \end{equation} From which we can solve: \begin{equation} d\boldsymbol{U} = \boldsymbol{U}(\boldsymbol{F}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U})) \label{eq:dU} \end{equation} Similarly, from \boldsymbol{\Sigma} \times \eqref{eq:vector-1} + \eqref{eq:vector-2} \times \boldsymbol{\Sigma}, we solve for d\boldsymbol{V}: \begin{equation} d\boldsymbol{V} = \boldsymbol{V}(\boldsymbol{F}\otimes(\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V})) \label{eq:dV} \end{equation} Equations [eq:dU] and [eq:dV] are the differentials of the singular vectors, where: \begin{equation} \boldsymbol{F}_{i,j} = \left\{\begin{aligned} &\, 1/(\sigma_j^2 - \sigma_i^2), &\, i\neq j \\ &\, 0, &\, i = j \end{aligned}\right. \end{equation}

Gradient (I)

Now that we have the differentials, how do we find the gradients? This is actually a bit tricky—not technically, but in terms of representation. For example, if \boldsymbol{W} and \boldsymbol{U} are n \times n matrices, the full gradient of \boldsymbol{U} with respect to \boldsymbol{W} is a 4th-order tensor of size n \times n \times n \times n, and high-order tensors are unfamiliar to most people, including myself.

To bypass the trouble of high-order tensors, we have two options. First, from a programming perspective, we don’t actually need to derive the explicit form of the gradient. Instead, we can write an equivalent forward form based on the differential results and let the framework’s automatic differentiation handle it. For instance, from equation [eq:d-sigma], we can conclude that the gradient of \sigma_i is equal to the gradient of {\color{skyblue}\boldsymbol{u}_i}^{\top} \boldsymbol{W} {\color{skyblue}\boldsymbol{v}_i}: \begin{equation} \nabla_{\boldsymbol{W}} \sigma_i = \nabla_{\boldsymbol{W}} ({\color{skyblue}\boldsymbol{u}_i}^{\top} \boldsymbol{W} {\color{skyblue}\boldsymbol{v}_i}) \end{equation} Here, the symbol color {\color{skyblue}\blacksquare} represents the stop_gradient operator to prevent the formula from becoming too bloated. Since \sigma_i is also equal to \boldsymbol{u}_i^{\top}\boldsymbol{W}\boldsymbol{v}_i, we only need to replace every occurrence of \sigma_i in the code with {\color{skyblue}\boldsymbol{u}_i}^{\top} \boldsymbol{W} {\color{skyblue}\boldsymbol{v}_i} to automatically obtain the correct gradient. Generally, we have: \begin{equation} \nabla_{\boldsymbol{W}} \boldsymbol{\Sigma} = \nabla_{\boldsymbol{W}} (\boldsymbol{I}\otimes({\color{skyblue}\boldsymbol{U}}^{\top} \boldsymbol{W} {\color{skyblue}\boldsymbol{V}})) \end{equation} That is, replace all \boldsymbol{\Sigma} with \boldsymbol{I}\otimes({\color{skyblue}\boldsymbol{U}}^{\top} \boldsymbol{W} {\color{skyblue}\boldsymbol{V}}).

Similarly, from equation [eq:dU], we know: \begin{equation} \nabla_{\boldsymbol{W}}\boldsymbol{U} = \nabla_{\boldsymbol{W}}({\color{skyblue}\boldsymbol{U}}({\color{skyblue}\boldsymbol{F}}\otimes({\color{skyblue}\boldsymbol{U}}^{\top}\boldsymbol{W}{\color{skyblue}\boldsymbol{V}\boldsymbol{\Sigma}} + {\color{skyblue}\boldsymbol{\Sigma}\boldsymbol{V}}^{\top}\boldsymbol{W}^{\top}{\color{skyblue}\boldsymbol{U}}))) \end{equation} It can be verified that \boldsymbol{U}(\boldsymbol{F}\otimes(\boldsymbol{U}^{\top}\boldsymbol{W}\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}\boldsymbol{W}^{\top}\boldsymbol{U})) is exactly a zero matrix. Therefore, we only need to replace every occurrence of \boldsymbol{U} in the code with: \begin{equation} \boldsymbol{U} \quad \to \quad {\color{skyblue}\boldsymbol{U}} + {\color{skyblue}\boldsymbol{U}}({\color{skyblue}\boldsymbol{F}}\otimes({\color{skyblue}\boldsymbol{U}}^{\top}\boldsymbol{W}{\color{skyblue}\boldsymbol{V}\boldsymbol{\Sigma}} + {\color{skyblue}\boldsymbol{\Sigma}\boldsymbol{V}}^{\top}\boldsymbol{W}^{\top}{\color{skyblue}\boldsymbol{U}})) \end{equation} This maintains the correct forward result while obtaining the correct gradient. Based on the same principle, the replacement for \boldsymbol{V} is: \begin{equation} \boldsymbol{V} \quad \to \quad {\color{skyblue}\boldsymbol{V}} + {\color{skyblue}\boldsymbol{V}}({\color{skyblue}\boldsymbol{F}}\otimes({\color{skyblue}\boldsymbol{V}}^{\top}\boldsymbol{W}^{\top}{\color{skyblue}\boldsymbol{U}\boldsymbol{\Sigma}} + {\color{skyblue}\boldsymbol{\Sigma}\boldsymbol{U}}^{\top}\boldsymbol{W}{\color{skyblue}\boldsymbol{V}})) \end{equation}

Gradient (II)

The second option is to directly solve for the gradient of the loss function with respect to \boldsymbol{W}. Specifically, assume the loss function is a function of \boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}, denoted as \mathcal{L}(\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}). We directly solve for \nabla_{\boldsymbol{W}}\mathcal{L}, which is a matrix that can be expressed using \boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}, \nabla_{\boldsymbol{U}}\mathcal{L}, \nabla_{\boldsymbol{\Sigma}}\mathcal{L}, \nabla_{\boldsymbol{V}}\mathcal{L}. These quantities are all just matrices, so no high-order tensors are involved.

In the previous section, we derived equivalent functions for \boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V} that have the same gradients. Excluding the parts under stop_gradient, these equivalent functions are linear with respect to \boldsymbol{W}. Thus, the problem essentially becomes finding the gradient of a linear composite function. We previously discussed related methods in the "Matrix Differentiation" section of "The Path to Low-Rank Approximation (I): Pseudo-inverse". Specifically, we have: \begin{align} \boldsymbol{X} = \boldsymbol{A}\boldsymbol{B}\boldsymbol{C} &\,\quad\Rightarrow\quad \nabla_{\boldsymbol{B}}f(\boldsymbol{X}) = \boldsymbol{A}^{\top}(\nabla_{\boldsymbol{X}}f(\boldsymbol{X}))\boldsymbol{C}^{\top} \\[8pt] \boldsymbol{X} = \boldsymbol{A}\boldsymbol{B}^{\top}\boldsymbol{C} &\,\quad\Rightarrow\quad \nabla_{\boldsymbol{B}}f(\boldsymbol{X}) = \boldsymbol{C}(\nabla_{\boldsymbol{X}}f(\boldsymbol{X}))^{\top}\boldsymbol{A} \\[8pt] \boldsymbol{X} = \boldsymbol{A}\otimes\boldsymbol{B} &\,\quad\Rightarrow\quad \nabla_{\boldsymbol{B}}f(\boldsymbol{X}) = \boldsymbol{A}\otimes \nabla_{\boldsymbol{X}}f(\boldsymbol{X}) \end{align} Using these basic formulas and the chain rule for composite functions, we can write: \begin{equation} \begin{aligned} \nabla_{\boldsymbol{W}}\mathcal{L} \quad = \qquad &\,\boldsymbol{U}(\boldsymbol{F}\otimes(\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{U}}\mathcal{L}) - (\nabla_{\boldsymbol{U}}\mathcal{L})^{\top}\boldsymbol{U}))\boldsymbol{\Sigma}\boldsymbol{V}^{\top} \\[6pt] + &\,\boldsymbol{U}(\boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L}))\boldsymbol{V}^{\top} \\[6pt] + &\,\boldsymbol{U}\boldsymbol{\Sigma}(\boldsymbol{F}\otimes(\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L}) - (\nabla_{\boldsymbol{V}}\mathcal{L})^{\top}\boldsymbol{V})))\boldsymbol{V}^{\top} \end{aligned} \end{equation} The entire process involves repeatedly applying basic formulas and the chain rule, as well as the property \boldsymbol{F}^{\top} = -\boldsymbol{F}. In principle, it is not difficult, but it requires careful concentration. I suggest readers complete this derivation themselves; it is a very practical exercise in matrix differentiation. Finally, introducing two notations: \begin{equation} {\color{red}[}\boldsymbol{X}{\color{red}]_{sym}} = \frac{1}{2}(\boldsymbol{X} + \boldsymbol{X}^{\top}),\qquad {\color{red}[}\boldsymbol{X}{\color{red}]_{skew}} = \frac{1}{2}(\boldsymbol{X} - \boldsymbol{X}^{\top}) \end{equation} We can simplify the gradient result to: \begin{equation} \nabla_{\boldsymbol{W}}\mathcal{L} = \boldsymbol{U}\Big(2(\boldsymbol{F}\otimes{\color{red}[}\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{U}}\mathcal{L}){\color{red}]_{skew}})\boldsymbol{\Sigma} + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L}) + 2\boldsymbol{\Sigma}(\boldsymbol{F}\otimes{\color{red}[}\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L}){\color{red}]_{skew}})\Big)\boldsymbol{V}^{\top} \label{eq:w-grad-l} \end{equation}

Gradient (III)

Now let’s try a small example: finding the gradient of \boldsymbol{O}=\operatorname{msign}(\boldsymbol{W})=\boldsymbol{U}\boldsymbol{V}^{\top}. The concept of \operatorname{msign} was discussed in "Appreciating the Muon Optimizer: The Essential Leap from Vectors to Matrices" when introducing Muon.

According to the definition of \operatorname{msign}, we have: \begin{equation} \nabla_{\boldsymbol{U}}\mathcal{L} = (\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V},\qquad \nabla_{\boldsymbol{V}}\mathcal{L} = (\nabla_{\boldsymbol{O}}\mathcal{L})^{\top} \boldsymbol{U} \end{equation} Substituting into equation [eq:w-grad-l], we get: \begin{equation} \begin{aligned} \nabla_{\boldsymbol{W}}\mathcal{L} =&\, 2\boldsymbol{U}\Big((\boldsymbol{F}\otimes{\color{red}[}\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}{\color{red}]_{skew}})\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(\boldsymbol{F}\otimes{\color{red}[}\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})^{\top}\boldsymbol{U}{\color{red}]_{skew}})\Big)\boldsymbol{V}^{\top} \\[6pt] =&\, 2\boldsymbol{U}\Big((\boldsymbol{F}\otimes{\color{red}[}\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}{\color{red}]_{skew}})\boldsymbol{\Sigma} - \boldsymbol{\Sigma}(\boldsymbol{F}\otimes{\color{red}[}\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}{\color{red}]_{skew}})\Big)\boldsymbol{V}^{\top} \\[6pt] =&\, 2\boldsymbol{U}\big(\boldsymbol{G}\otimes{\color{red}[}\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}{\color{red}]_{skew}}\big)\boldsymbol{V}^{\top} \end{aligned} \end{equation} where \boldsymbol{G}_{i,j} = 1/(\sigma_i + \sigma_j).

Gradient (IV)

Finally, let’s consider a common special case: when \boldsymbol{W} is a positive definite symmetric matrix, its SVD takes the form \boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}, i.e., \boldsymbol{U}=\boldsymbol{V}. Repeating the previous derivation, we first take the differential of both sides: \begin{equation} d\boldsymbol{W} = (d\boldsymbol{V})\boldsymbol{\Sigma}\boldsymbol{V}^{\top} + \boldsymbol{V}(d\boldsymbol{\Sigma})\boldsymbol{V}^{\top} + \boldsymbol{V}\boldsymbol{\Sigma}(d\boldsymbol{V})^{\top} \end{equation} Then multiply by \boldsymbol{V}^{\top} on the left and \boldsymbol{V} on the right: \begin{equation} \begin{aligned} \boldsymbol{V}^{\top}(d\boldsymbol{W})\boldsymbol{V} =&\, \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V} \\[6pt] =&\, \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} - \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{V}) \end{aligned} \end{equation} From this, it can be seen that: \begin{gather} d\boldsymbol{\Sigma} = \boldsymbol{I}\otimes(\boldsymbol{V}(d\boldsymbol{W})\boldsymbol{V}^{\top}) \\[8pt] \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} - \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{V}) = \bar{\boldsymbol{I}}\otimes(\boldsymbol{V}(d\boldsymbol{W})\boldsymbol{V}^{\top}) \end{gather} From the second equation, we can further solve: \begin{equation} d\boldsymbol{V} = \boldsymbol{V}(\boldsymbol{K}^{\top}\otimes(\boldsymbol{V}^{\top}(d\boldsymbol{W})\boldsymbol{V})) \end{equation} where: \begin{equation} \boldsymbol{K}_{i,j} = \left\{\begin{aligned} &\, 1/(\sigma_i - \sigma_j), &\, i\neq j \\ &\, 0, &\, i = j \end{aligned}\right. \end{equation} Based on this result, we have: \begin{equation} \nabla_{\boldsymbol{W}}\mathcal{L} = \boldsymbol{V}(\boldsymbol{K}^{\top}\otimes(\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L})) + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L}))\boldsymbol{V}^{\top} \end{equation} Watch out for the trap! The above equation is actually incorrect. It is the result of the chain rule, but the chain rule only applies to unconstrained differentiation. Here, there is a constraint \boldsymbol{W}=\boldsymbol{W}^{\top}, so the correct gradient should also include symmetrization: \begin{equation} \begin{aligned} \nabla_{\boldsymbol{W}}\mathcal{L} =&\, \boldsymbol{V}\Big({\color{red}[}\boldsymbol{K}^{\top}\otimes(\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L})){\color{red}]_{sym}} + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L})\Big)\boldsymbol{V}^{\top} \\[6pt] =&\, \boldsymbol{V}\Big(\boldsymbol{K}^{\top}\otimes{\color{red}[}\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L}){\color{red}]_{skew}} + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L})\Big)\boldsymbol{V}^{\top} \end{aligned} \end{equation} Another trap is directly substituting \boldsymbol{U}=\boldsymbol{V} into equation [eq:w-grad-l] for derivation. This will lead to the \boldsymbol{K} term being doubled. The reason is that equation [eq:w-grad-l] distinguishes between \nabla_{\boldsymbol{U}}\mathcal{L} and \nabla_{\boldsymbol{V}}\mathcal{L}, whereas under the positive definite symmetric assumption, \boldsymbol{U} and \boldsymbol{V} are the same. Taking the gradient with respect to \boldsymbol{V} effectively sums the original \nabla_{\boldsymbol{U}}\mathcal{L} and \nabla_{\boldsymbol{V}}\mathcal{L}, leading to double counting. For related literature, see "Matrix Backpropagation for Deep Networks With Structured Layers".

Numerical Issues

Some readers might wonder: I can only guarantee that the singular values of the initialization matrix are distinct; how can I ensure the matrix still satisfies this condition after training? The answer is that the gradient naturally helps us ensure this. From equations [eq:dU] and [eq:dV], we can see that the gradient contains \boldsymbol{F}, and since \boldsymbol{F}_{i,j} = \frac{1}{\sigma_j^2 - \sigma_i^2}, once two singular values become close, the gradient will become very large, so the optimizer will automatically push them apart.

However, this property also brings numerical instability to actual training, mainly manifested as gradient explosion when \sigma_i, \sigma_j are close. To address this, the paper "Backpropagation-Friendly Eigendecomposition" proposes using "Power Iteration" instead of exact SVD. Later, the paper "Robust Differentiable SVD" proved that this is theoretically equivalent to a Taylor approximation of \frac{1}{\sigma_i - \sigma_j} in \boldsymbol{F}_{i,j} = -\frac{1}{(\sigma_i + \sigma_j)(\sigma_i - \sigma_j)} (assuming \sigma_j < \sigma_i): \begin{equation} \frac{1}{\sigma_i - \sigma_j} = \frac{1}{\sigma_i}\frac{1}{1-(\sigma_j/\sigma_i)}\approx \frac{1}{\sigma_i}\left(1 + \left(\frac{\sigma_j}{\sigma_i}\right) + \left(\frac{\sigma_j}{\sigma_i}\right)^2 + \cdots + \left(\frac{\sigma_j}{\sigma_i}\right)^N \right) \end{equation} Using the Taylor approximation prevents gradient explosion when \sigma_j \to \sigma_i. Later, the authors in "Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?" generalized this to the more general Padé approximation. I am not very familiar with this series of work, so I won’t expand on it further.

I just have one question: if the goal is simply to avoid numerical explosion, is it necessary to use these complex tools? Couldn’t we just add a truncation to \sigma_j/\sigma_i? For example: \begin{equation} \frac{1}{1-(\sigma_j/\sigma_i)}\approx \frac{1}{1-\min(\sigma_j/\sigma_i, 0.99)} \end{equation} Wouldn’t this simply prevent it from tending to infinity? Or is there something I’m missing? If readers are familiar with the relevant background, please correct me in the comments. Thank you.

General Results

So far, all matrices appearing above have been n \times n matrices, because the derivation was conducted under the initial assumption that "\boldsymbol{W} is a full-rank n \times n matrix, and all singular values are distinct." In this section, we discuss to what extent these conditions can be relaxed.

Briefly, the condition for SVD to be differentiable is that "all non-zero singular values are distinct." In other words, the matrix doesn’t have to be square or full-rank, but the non-zero singular values must still be unique. This is because once there are identical singular values, the SVD is no longer unique, which fundamentally breaks differentiability. Of course, if we only need partial derivatives, we can relax the conditions further. For example, if we only want the derivative of the spectral norm \sigma_1, we only need \sigma_1 > \sigma_2.

So, how do the differential results change after relaxing the conditions? Let’s assume generally that \boldsymbol{W} \in \mathbb{R}^{n \times m} with rank r, and its SVD is: \begin{equation} \boldsymbol{W} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top},\quad\boldsymbol{U}\in\mathbb{R}^{n\times n},\boldsymbol{\Sigma}\in\mathbb{R}^{n\times m},\boldsymbol{V}\in\mathbb{R}^{m\times m} \end{equation} The reason zero singular values are allowed is that we only care about d\boldsymbol{U}_{[:, :r]} and d\boldsymbol{V}_{[:, :r]}. Under the assumption that non-zero singular values are distinct, they can be uniquely determined. Starting from equation [eq:dU-0], the derivation up to that point is general. Equation [eq:dU-0] also shows why identical singular values are rejected: when \sigma_i = \sigma_j, \sigma_j - \sigma_i cannot be inverted. Keeping only the parts related to d\boldsymbol{U}_{[:, :r]} in equation [eq:dU-0], we get: \begin{equation} \bar{\boldsymbol{I}}_{[:, :r]}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma}_{[:, :r]} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]}) = \boldsymbol{E}_{[:, :r]}\otimes (\boldsymbol{U}^{\top}(d\boldsymbol{U}_{[:, :r]})) \end{equation} Here the definition of \boldsymbol{E} remains \boldsymbol{E}_{i,j} = \sigma_j^2 - \sigma_i^2, but \boldsymbol{E}_{[:, :r]} specifically excludes all zeros, so it can be inverted successfully. The result is: \begin{equation} d\boldsymbol{U}_{[:, :r]} = \boldsymbol{U}(\boldsymbol{F}_{[:, :r]}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma}_{[:, :r]} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]})) \end{equation} Furthermore, using the following block partitions: \begin{equation} \boldsymbol{U} = \begin{pmatrix}\boldsymbol{U}_{[:,:r]} & \boldsymbol{U}_{[:,r:]}\end{pmatrix},\quad \boldsymbol{F}_{[:, :r]} = \begin{pmatrix} \boldsymbol{F}_{[:r, :r]} \\ \boldsymbol{F}_{[r:, :r]}\end{pmatrix} \end{equation} And since \boldsymbol{V}\boldsymbol{\Sigma}_{[:, :r]} = \boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}, we can obtain: \begin{equation} \begin{aligned} d\boldsymbol{U}_{[:, :r]} =&\, \boldsymbol{U}_{[:, :r]}(\boldsymbol{F}_{[:r, :r]}\otimes(\boldsymbol{U}_{[:, :r]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]} + \boldsymbol{\Sigma}_{[:r, :r]}\boldsymbol{V}_{[:, :r]}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]})) \\[6pt] &\,\qquad + \boldsymbol{U}_{[:, r:]}(\boldsymbol{F}_{[r:, :r]}\otimes(\boldsymbol{U}_{[:, r:]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]})) \end{aligned} \label{eq:dU-r} \end{equation} According to the assumption, when i > r, \sigma_i=0, so every row of \boldsymbol{F}_{[r:, :r]} is (\sigma_1^{-2},\sigma_2^{-2},\cdots,\sigma_r^{-2}). Thus, \boldsymbol{F}_{[r:, :r]}\otimes is equivalent to right-multiplying by \boldsymbol{\Sigma}_{[:r, :r]}^{-2}. Therefore: \begin{equation} \boldsymbol{F}_{[r:, :r]}\otimes(\boldsymbol{U}_{[:, r:]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}) = \boldsymbol{U}_{[:, r:]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}^{-1} \end{equation} Finally, using \boldsymbol{U}\boldsymbol{U}^{\top} = \boldsymbol{I} and \boldsymbol{U} = (\boldsymbol{U}_{[:,:r]},\boldsymbol{U}_{[:,r:]}), we get \boldsymbol{U}_{[:, r:]}\boldsymbol{U}_{[:, r:]}^{\top} = \boldsymbol{I} - \boldsymbol{U}_{[:, :r]}\boldsymbol{U}_{[:, :r]}^{\top}. Substituting this into equation [eq:dU-r] gives: \begin{equation} \begin{aligned} d\boldsymbol{U}_{[:, :r]} =&\, \boldsymbol{U}_{[:, :r]}(\boldsymbol{F}_{[:r, :r]}\otimes(\boldsymbol{U}_{[:, :r]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]} + \boldsymbol{\Sigma}_{[:r, :r]}\boldsymbol{V}_{[:, :r]}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]})) \\[6pt] &\,\qquad {\color{orange}+ (\boldsymbol{I} - \boldsymbol{U}_{[:, :r]}\boldsymbol{U}_{[:, :r]}^{\top})(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}^{-1}} \end{aligned} \end{equation} This expresses d\boldsymbol{U}_{[:, r:]} as a function of \boldsymbol{U}_{[:, :r]}, \boldsymbol{\Sigma}_{[:r, :r]}, \boldsymbol{V}_{[:, :r]}. Under the assumption that non-zero singular values are distinct, these three are uniquely determined, so d\boldsymbol{U}_{[:, :r]} is uniquely determined. Compared to equation [eq:dU], it includes the extra term in orange. Similarly: \begin{equation} \begin{aligned} d\boldsymbol{V}_{[:, :r]} =&\, \boldsymbol{V}_{[:, :r]}(\boldsymbol{F}_{[:r, :r]}\otimes(\boldsymbol{V}_{[:, :r]}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]} + \boldsymbol{\Sigma}_{[:r, :r]}\boldsymbol{U}_{[:, :r]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]})) \\[6pt] &\,\qquad {\color{orange}+ (\boldsymbol{I} - \boldsymbol{V}_{[:, :r]}\boldsymbol{V}_{[:, :r]}^{\top})(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}^{-1}} \end{aligned} \end{equation}

Summary

This article has provided a detailed derivation of the differentiation formulas for SVD.

When reposting, please include the original address: https://kexue.fm/archives/10878

For more details on reposting, please refer to: "Scientific Space FAQ"