The Road to Low-Rank Approximation (I): Pseudo-inverse · English (unofficial) translations of posts at kexue.fm

Many readers, like the author, may have a feeling of both familiarity and strangeness regarding the low-rank approximation of matrices. Familiarity arises because the concept and significance of low-rank approximation are not difficult to understand, and currently, fine-tuning techniques based on low-rank approximation, such as LoRA, are flourishing everywhere, making the concept deeply ingrained through constant exposure. However, the content covered by low-rank approximation is very broad. In papers related to low-rank approximation, we often see unfamiliar but breathtaking new techniques, which leads to a sense of "half-understanding" or strangeness.

Therefore, in this series of articles, the author will attempt to systematically organize the theoretical content related to matrix low-rank approximation to complete our understanding. In the first article, we mainly introduce a relatively simple concept in the low-rank approximation series—the pseudo-inverse.

Optimization Perspective

Pseudo-inverse, also known as the "Generalized Inverse," is, as the name suggests, a "generalized version of an inverse matrix." It is actually an extension of the concept of an "inverse matrix" to non-invertible matrices.

We know that for the matrix equation \boldsymbol{A}\boldsymbol{B}=\boldsymbol{M}, if \boldsymbol{A} is a square matrix and is invertible, we directly get \boldsymbol{B}=\boldsymbol{A}^{-1}\boldsymbol{M}. But what if \boldsymbol{A} is not invertible or simply not a square matrix? In this case, we likely cannot find a \boldsymbol{B} that satisfies \boldsymbol{A}\boldsymbol{B}=\boldsymbol{M}. If we still want to proceed, we usually turn to an optimization problem: \begin{equation} \mathop{\mathrm{arg\,min}}_{\boldsymbol{B}} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 \label{eq:loss-ab-m} \end{equation} where \boldsymbol{A}\in\mathbb{R}^{n\times r}, \boldsymbol{B}\in\mathbb{R}^{r\times m}, \boldsymbol{M}\in\mathbb{R}^{n\times m}. Note that the number field is \mathbb{R}, indicating that this series focuses on real matrices. \Vert\cdot\Vert_F is the Frobenius norm (F-norm) of a matrix, used to measure the distance between the matrix \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M} and the zero matrix. Its definition is: \begin{equation} \Vert \boldsymbol{M}\Vert_F = \sqrt{\sum_{i=1}^n\sum_{j=1}^m M_{i,j}^2} \end{equation} Simply put, it changes from seeking an exact inverse matrix to minimizing the squared error between \boldsymbol{A}\boldsymbol{B} and \boldsymbol{M}. The theme of this series is low-rank approximation, so we assume r \ll \min(n,m). Its machine learning significance is to reconstruct the complete matrix \boldsymbol{M} through a low-dimensional, lossy input matrix \boldsymbol{A} and a linear transformation \boldsymbol{B}.

When m=n and \boldsymbol{M} is the identity matrix \boldsymbol{I}_n, we obtain a result that depends only on \boldsymbol{A}, denoted as: \begin{equation} \boldsymbol{A}^{\dagger} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{B}} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 \label{eq:loss-ab-m-b} \end{equation} Its function is similar to the inverse matrix of \boldsymbol{A}, so it is called the "(right) pseudo-inverse" of \boldsymbol{A}. Similarly, if the \boldsymbol{B} matrix is given, we can also change the optimization parameter to \boldsymbol{A}, obtaining the "(left) pseudo-inverse" of \boldsymbol{B}: \begin{equation} \boldsymbol{B}^{\dagger} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{A}} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 \end{equation}

Norm Related

Before further derivation, let’s supplement the introduction to the F-norm. Many readers are already familiar with vector norms, the most classic being the "p-norm": for \boldsymbol{x}=(x_1,x_2,\cdots,x_m), its p-norm is defined as: \begin{equation} \Vert \boldsymbol{x}\Vert_p = \sqrt[\uproot{10}p]{\sum_{i=1}^m |x_i|^p} \end{equation} Among p-norms, the most common is the case where p=2, which is what we often call the vector magnitude or "Euclidean norm." If the subscript of the norm is omitted and only \Vert \boldsymbol{x}\Vert is written, it basically defaults to p=2.

Matrix norms are slightly more complex. There are at least two different but commonly used norms, one of which is the F-norm mentioned in the previous section. It is a norm calculated by directly flattening the matrix into a vector: \begin{equation} \Vert \boldsymbol{M}\Vert_F = \Vert \text{vec}(\boldsymbol{M})\Vert_2 = \sqrt{\sum_{i=1}^n\sum_{j=1}^m M_{i,j}^2} \end{equation} Other matrix norms will be introduced when we encounter them. Due to the diversity of matrix norms, the subscript {}_F in \Vert \boldsymbol{M}\Vert_F usually cannot be omitted to avoid confusion. The F-norm is derived by treating a matrix as a vector and then copying the definition of the vector norm. This inspires us to try moving more vector operations to matrices, such as the inner product: \begin{equation} \langle \boldsymbol{P}, \boldsymbol{Q} \rangle_F = \langle \text{vec}(\boldsymbol{P}), \text{vec}(\boldsymbol{Q}) \rangle = \sum_{i=1}^n\sum_{j=1}^m P_{i,j} Q_{i,j} \end{equation} This is called the F-inner product (Frobenius Inner Product) of matrices \boldsymbol{P} and \boldsymbol{Q}, where \boldsymbol{P}, \boldsymbol{Q} \in \mathbb{R}^{n \times m}. It can be expressed using the trace operation: \begin{equation} \langle \boldsymbol{P}, \boldsymbol{Q} \rangle_F = \mathop{\mathrm{Tr}}(\boldsymbol{P}^{\top} \boldsymbol{Q}) \end{equation} This can be proven directly from the definitions of matrix multiplication and trace (readers are encouraged to try). When \boldsymbol{P} and \boldsymbol{Q} are formed by the multiplication of multiple matrices, converting to the equivalent trace operation often helps us simplify. For example, using it we can prove that orthogonal transformations do not change the F-norm: assuming \boldsymbol{U} is an orthogonal matrix, using \Vert \boldsymbol{M}\Vert_F^2 = \langle \boldsymbol{M}, \boldsymbol{M} \rangle_F and the relationship between the F-inner product and the trace, we get: \begin{equation} \Vert \boldsymbol{U}\boldsymbol{M}\Vert_F^2 = \mathop{\mathrm{Tr}}((\boldsymbol{U}\boldsymbol{M})^{\top} \boldsymbol{U}\boldsymbol{M}) = \mathop{\mathrm{Tr}}(\boldsymbol{M}^{\top} \boldsymbol{U}^{\top}\boldsymbol{U}\boldsymbol{M}) = \mathop{\mathrm{Tr}}(\boldsymbol{M}^{\top} \boldsymbol{M}) = \Vert \boldsymbol{M}\Vert_F^2 \end{equation}

Matrix Differentiation

Returning to the main topic, for an optimization objective, the most ideal result is naturally to be able to find an analytical solution through differentiation, and equation [eq:loss-ab-m] allows exactly that! This conclusion can be simply "observed": \boldsymbol{A}\boldsymbol{B}-\boldsymbol{M} is a linear function of \boldsymbol{B}, so \Vert \boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}\Vert_F^2 is a quadratic function of \boldsymbol{A} or \boldsymbol{B}. The minimum of a quadratic function has an analytical solution.

To find the derivative of \mathcal{L}=\Vert \boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}\Vert_F^2 with respect to \boldsymbol{B}, we first find the derivative of \mathcal{L} with respect to \boldsymbol{E}=\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}, then the derivative of \boldsymbol{E} with respect to \boldsymbol{B}, and finally combine them using the chain rule: \begin{equation} \frac{\partial \mathcal{L}}{\partial B_{i,j}} = \sum_{k,l}\frac{\partial \mathcal{L}}{\partial E_{k,l}}\frac{\partial E_{k,l}}{\partial B_{i,j}} \end{equation} According to the definition \mathcal{L}=\Vert \boldsymbol{E}\Vert_F^2 = \sum_{i,j} E_{i,j}^2, it is clear that among the many squares in the sum, the derivative with respect to E_{k,l} is non-zero only when (i,j)=(k,l). So the derivative of \mathcal{L} with respect to E_{k,l} is the derivative of E_{k,l}^2 with respect to E_{k,l}, which is \frac{\partial \mathcal{L}}{\partial E_{k,l}}=2E_{k,l}. Next, according to the definition of matrix multiplication: \begin{equation} E_{k,l} = \left(\sum_{\alpha} A_{k,\alpha}B_{\alpha,l}\right) - M_{k,l} \end{equation} Similarly, only when (\alpha,l)=(i,j) will the derivative of the above expression with respect to B_{i,j} produce a non-zero result A_{k,i}. So we can write \frac{\partial E_{k,l}}{\partial B_{i,j}}=A_{k,i}\delta_{l,j}, where \delta_{l,j} is the Kronecker symbol, used to declare the condition l=j. Combining the results, we get: \begin{equation} \frac{\partial \mathcal{L}}{\partial B_{i,j}} = 2\sum_{k,l}E_{k,l}A_{k,i}\delta_{l,j} = 2\sum_k E_{k,j}A_{k,i} \end{equation} If we agree that the shape of the gradient of a scalar with respect to a matrix is consistent with the matrix itself, then we can write: \begin{equation} \frac{\partial \mathcal{L}}{\partial \boldsymbol{B}} = 2\boldsymbol{A}^{\top}(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}) \end{equation} Although the derivation process is somewhat involved, the result is quite intuitive: intuitively, \frac{\partial \mathcal{L}}{\partial \boldsymbol{B}} is the product of 2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M}) and \boldsymbol{A} (analogous to scalar differentiation). Since we agreed that the shape of \frac{\partial \mathcal{L}}{\partial \boldsymbol{B}} matches \boldsymbol{B} (i.e., r\times m), we must find a way to multiply 2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})\in\mathbb{R}^{n\times m} and \boldsymbol{A}\in\mathbb{R}^{n\times r} to produce an r\times m result, which leads uniquely to the expression above. By the same principle, we can quickly write: \begin{equation} \frac{\partial \mathcal{L}}{\partial \boldsymbol{A}} = 2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})\boldsymbol{B}^{\top} \end{equation}

Basic Results

Now that we have found \frac{\partial \mathcal{L}}{\partial \boldsymbol{B}} and \frac{\partial \mathcal{L}}{\partial \boldsymbol{A}}, setting them to zero allows us to solve for the corresponding optimal solutions: \begin{align} 2\boldsymbol{A}^{\top}(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})=0 \quad &\Rightarrow \quad \boldsymbol{B} = (\boldsymbol{A}^{\top} \boldsymbol{A})^{-1}\boldsymbol{A}^{\top}\boldsymbol{M} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{B}} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 \\ 2(\boldsymbol{A}\boldsymbol{B}-\boldsymbol{M})\boldsymbol{B}^{\top}=0 \quad &\Rightarrow \quad \boldsymbol{A} = \boldsymbol{M}\boldsymbol{B}^{\top}(\boldsymbol{B} \boldsymbol{B}^{\top})^{-1} = \mathop{\mathrm{arg\,min}}_{\boldsymbol{A}} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 \end{align} Substituting \boldsymbol{M}=\boldsymbol{I}_n, we get: \begin{align} \boldsymbol{A}^{\dagger} &= (\boldsymbol{A}^{\top} \boldsymbol{A})^{-1}\boldsymbol{A}^{\top} \label{eq:p-inv-a}\\ \boldsymbol{B}^{\dagger} &= \boldsymbol{B}^{\top}(\boldsymbol{B} \boldsymbol{B}^{\top})^{-1} \label{eq:p-inv-b} \end{align} If \boldsymbol{A} or \boldsymbol{B} is an invertible square matrix, it is easy to prove that the pseudo-inverse equals the inverse matrix, i.e., \boldsymbol{A}^{\dagger}=\boldsymbol{A}^{-1}, \boldsymbol{B}^{\dagger}=\boldsymbol{B}^{-1}. Furthermore, based on the above equations, we can verify:

1. (\boldsymbol{A}^{\dagger})^{\dagger}=\boldsymbol{A}, (\boldsymbol{B}^{\dagger})^{\dagger}=\boldsymbol{B}, meaning the pseudo-inverse of a pseudo-inverse is the original matrix. This implies that while the pseudo-inverse acts as an approximate inverse, it preserves its own information.

2. \boldsymbol{A}\boldsymbol{A}^{\dagger}\boldsymbol{A}=\boldsymbol{A}, \boldsymbol{B}^{\dagger}\boldsymbol{B}\boldsymbol{B}^{\dagger}=\boldsymbol{B}^{\dagger}, meaning that although \boldsymbol{A}\boldsymbol{A}^{\dagger} and \boldsymbol{B}^{\dagger}\boldsymbol{B} may not become the identity matrix \boldsymbol{I}, they act as the identity matrix for \boldsymbol{A} and \boldsymbol{B}^{\dagger}.

By the way, the pseudo-inverse of a matrix is actually a very broad concept with many different forms. What we have introduced here is the most common "Moore–Penrose inverse." Others include the "Drazin inverse," "Bott–Duffin inverse," etc., but the author is not familiar with these, so they will not be expanded upon. Readers can refer to the "Generalized inverse" entry on Wikipedia.

General Form

However, we are not finished. A key prerequisite for equations [eq:p-inv-a] and [eq:p-inv-b] to hold is that the corresponding \boldsymbol{A}^{\top} \boldsymbol{A} or \boldsymbol{B} \boldsymbol{B}^{\top} must be invertible. What if they are not?

Taking \boldsymbol{A}^{\dagger} as an example, if \boldsymbol{A}^{\top} \boldsymbol{A} is not invertible, it means the rank of \boldsymbol{A} is less than r. We can only find a maximal linearly independent set of s < r column vectors to form a matrix \boldsymbol{A}_s \in \mathbb{R}^{n \times s}, and then \boldsymbol{A} can be expressed as \boldsymbol{A}_s \boldsymbol{P}, where \boldsymbol{P} \in \mathbb{R}^{s \times r} is the matrix mapping \boldsymbol{A}_s to \boldsymbol{A}. In this case: \begin{equation} \mathop{\mathrm{arg\,min}}_{\boldsymbol{B}} \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 = \mathop{\mathrm{arg\,min}}_{\boldsymbol{B}} \Vert \boldsymbol{A}_s \boldsymbol{P}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 \end{equation} If the optimal solution for \boldsymbol{B} is still denoted as \boldsymbol{A}^{\dagger}, we can only determine: \begin{equation} \boldsymbol{P}\boldsymbol{A}^{\dagger} = \boldsymbol{A}_s^{\dagger} = (\boldsymbol{A}_s^{\top} \boldsymbol{A}_s)^{-1}\boldsymbol{A}_s^{\top} \end{equation} Since we assumed \boldsymbol{A}_s is a maximal linearly independent set, \boldsymbol{A}_s^{\top} \boldsymbol{A}_s is necessarily invertible, so the above expression is well-defined. However, the mapping from \boldsymbol{A}^{\dagger} to \boldsymbol{P}\boldsymbol{A}^{\dagger} is a dimensionality reduction, which means multiple \boldsymbol{A}^{\dagger} exist such that \boldsymbol{P}\boldsymbol{A}^{\dagger} = \boldsymbol{A}_s^{\dagger}. That is, the optimal solution to [eq:loss-ab-m-b] is not unique when \boldsymbol{A}^{\top} \boldsymbol{A} is non-invertible.

One possible approach is to supplement the definition with (\boldsymbol{A}^{\dagger})^{\dagger}=\boldsymbol{A} or \boldsymbol{A}^{\dagger}\boldsymbol{A}\boldsymbol{A}^{\dagger}=\boldsymbol{A}^{\dagger} to uniquely determine \boldsymbol{A}^{\dagger}. However, this feels like "patching." Instead, we can use a clever trick to handle this problem elegantly and uniformly. The issue is the non-uniqueness of the solution; we can add a regularization term to make it unique, and then let the weight of the regularization term tend to zero: \begin{equation} \boldsymbol{A}^{\dagger} = \lim_{\epsilon\to 0^+} \mathop{\mathrm{arg\,min}}_{\boldsymbol{B}} \left( \Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 + \epsilon\Vert \boldsymbol{B}\Vert_F^2 \right) \end{equation} From this, we can solve: \begin{equation} \boldsymbol{A}^{\dagger} = \lim_{\epsilon\to 0^+} (\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1}\boldsymbol{A}^{\top} \label{eq:a-pinv-lim} \end{equation} When \epsilon > 0, \boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r is necessarily invertible (readers are encouraged to prove this), so the expression is well-defined. Since the regularization term becomes negligible as \epsilon \to 0, this limit must exist. Note that this refers to the existence of the overall limit; when \boldsymbol{A}^{\top} \boldsymbol{A} is non-invertible, the limit \lim_{\epsilon\to 0} (\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1} does not exist (it goes to infinity), but the product with \boldsymbol{A}^{\top} remains finite.

What are the advantages of equation [eq:a-pinv-lim] as a general generalization? First, it is consistent with the invertible case [eq:p-inv-a], providing theoretical elegance. Second, this formal consistency allows properties like (\boldsymbol{A}^{\dagger})^{\dagger}=\boldsymbol{A} to be preserved, allowing us to discuss \boldsymbol{A}^{\dagger} almost without worrying about the invertibility of \boldsymbol{A}^{\top} \boldsymbol{A}.

Numerical Computation

Of course, equation [eq:a-pinv-lim] is currently just a formal definition. If used directly for numerical computation, one would have to pick a sufficiently small \epsilon and compute (\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1}, which inevitably leads to severe numerical instability. To obtain a stable calculation method, we use the fact that real symmetric matrices can always be orthogonally diagonalized (Spectral Theorem) to decompose \boldsymbol{A}^{\top} \boldsymbol{A}: \begin{equation} \boldsymbol{A}^{\top} \boldsymbol{A} = \boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top} \end{equation} where \boldsymbol{U} is an orthogonal matrix and \boldsymbol{\Lambda}=\mathop{\mathrm{diag}}(\lambda_1,\lambda_2,\cdots,\lambda_r) is a diagonal matrix of eigenvalues. Due to the semi-definiteness of \boldsymbol{A}^{\top} \boldsymbol{A}, its eigenvalues are non-negative. Using this decomposition: \begin{equation} \begin{aligned} (\boldsymbol{A}^{\top} \boldsymbol{A} + \epsilon \boldsymbol{I}_r)^{-1}\boldsymbol{A}^{\top} &= (\boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top} + \epsilon \boldsymbol{I}_r)^{-1} \boldsymbol{A}^{\top} \\ &= [\boldsymbol{U}(\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r) \boldsymbol{U}^{\top}]^{-1}\boldsymbol{A}^{\top} \\ &= \boldsymbol{U}(\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1} \boldsymbol{U}^{\top}\boldsymbol{A}^{\top} \end{aligned} \end{equation} For (\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1}, we have: \begin{equation} (\boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1} = \mathop{\mathrm{diag}}\Big((\lambda_1 + \epsilon)^{-1},(\lambda_2 + \epsilon)^{-1},\cdots,(\lambda_r + \epsilon)^{-1}\Big) \end{equation} If \lambda_i > 0, then \lim_{\epsilon\to 0} (\lambda_i + \epsilon)^{-1} = \lambda_i^{-1}, which is finite. The problem arises when \lambda_i = 0, where \lim_{\epsilon\to 0} (\lambda_i + \epsilon)^{-1} \to \infty. However, we know the limit [eq:a-pinv-lim] must be finite. Therefore, if \lambda_i = 0, the corresponding row in \boldsymbol{U}^{\top}\boldsymbol{A}^{\top} must be zero to cancel the infinity. The only way to cancel such infinity is "multiplication by 0," i.e., \lim_{\epsilon\to 0} \epsilon^{-1} \times 0 = 0.

In other words, if \lambda_i = 0, the factor multiplied by (\lambda_i + \epsilon)^{-1} in \boldsymbol{U}^{\top}\boldsymbol{A}^{\top} must be 0. Since "0 times anything is 0," the value of (\lambda_i + \epsilon)^{-1} when \lambda_i = 0 actually doesn’t matter; we can simply set it to 0. Thus, we obtain a general method for calculating \boldsymbol{A}^{\dagger}: \begin{equation} \boldsymbol{A}^{\dagger} = \boldsymbol{U}\boldsymbol{\Lambda}^{\dagger}\boldsymbol{U}^{\top}\boldsymbol{A}^{\top}, \quad \boldsymbol{A}^{\top} \boldsymbol{A} = \boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top} \end{equation} where \boldsymbol{\Lambda}^{\dagger} denotes a diagonal matrix where the diagonal elements are unchanged if they are zero, and replaced by their reciprocal if they are non-zero.

One might ask: if "0 times anything is 0," why must the zero \lambda_i remain zero? Could it be any other value? In fact, any value would not affect the result, but to maintain consistency with equation [eq:a-pinv-lim], we define: \begin{equation} \boldsymbol{\Lambda}^{\dagger} = \lim_{\epsilon\to 0} (\boldsymbol{\Lambda}^{\top} \boldsymbol{\Lambda} + \epsilon \boldsymbol{I}_r)^{-1}\boldsymbol{\Lambda}^{\top} = \mathop{\mathrm{diag}}(\kappa_1,\kappa_2,\cdots,\kappa_n), \quad \kappa_i = \begin{cases} \lambda_i^{-1}, & \lambda_i \neq 0 \\ 0, & \lambda_i = 0 \end{cases} \end{equation}

Summary

In this article, we introduced the pseudo-inverse from the perspective of low-rank approximation. This is an extension of the concept of the inverse matrix to non-square or non-invertible matrices, allowing us to more effectively analyze and solve general matrix equations.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/10366

For more details on reprinting, please refer to: "Scientific Space FAQ"