Returning once again to the road of low-rank approximation. In The Road to Low-Rank Approximation (IV): ID, we introduced "Interpolative Decomposition (ID)," which is the process of finding an approximation of the form \boldsymbol{C}\boldsymbol{Z} for a matrix \boldsymbol{M}\in\mathbb{R}^{n\times m}, where \boldsymbol{C}\in\mathbb{R}^{n\times r} consists of several columns of the matrix \boldsymbol{M}, and \boldsymbol{Z}\in\mathbb{R}^{r\times m} is an arbitrary matrix.
In this article, we will introduce the CUR decomposition. It follows the same philosophy as Interpolative Decomposition, using the rows and columns of the original matrix as a "skeleton" to construct an approximation of the original matrix. Unlike ID, which uses either rows or columns, CUR decomposition utilizes both rows and columns simultaneously.
Basic Definition
Actually, this is not the first time CUR decomposition has appeared on this site. As early as Nyströmformer: A Linearized Attention Scheme Based on Matrix Factorization, we introduced the Nyström approximation of a matrix, which is essentially a CUR decomposition. Later, in Using CUR Decomposition to Accelerate Retrieval in Interactive Similarity Models, we also introduced the application of CUR decomposition in reducing the retrieval complexity of interactive similarity models.
The key to these applications of CUR decomposition lies in the "C" and "R" in its name. Specifically, CUR decomposition seeks to find an approximation for a matrix \boldsymbol{M}\in\mathbb{R}^{n\times m} in the following form: \begin{equation} \mathop{\text{argmin}}_{S_1,S_2,\boldsymbol{\mathcal{U}}}\Vert \underbrace{\boldsymbol{M}_{[:,S_1]}}_{\boldsymbol{\mathcal{C}}}\boldsymbol{\mathcal{U}}\underbrace{\boldsymbol{M}_{[S_2,:]}}_{\boldsymbol{\mathcal{R}}} - \boldsymbol{M}\Vert_F^2\quad\text{s.t.}\quad \left\{\begin{aligned}&S_1\subset\{0,1,\cdots,m-1\},|S_1|=r\\ &S_2\subset\{0,1,\cdots,n-1\},|S_2|=r \\ &\boldsymbol{\mathcal{U}}\in\mathbb{R}^{r\times r} \end{aligned}\right. \end{equation} To distinguish it from the \boldsymbol{U} in SVD, we use calligraphic letters \boldsymbol{\mathcal{C}}, \boldsymbol{\mathcal{U}}, \boldsymbol{\mathcal{R}}. For comparison, the ID introduced in the previous article is: \begin{equation} \mathop{\text{argmin}}_{S,\boldsymbol{Z}}\Vert \underbrace{\boldsymbol{M}_{[:,S]}}_{\boldsymbol{C}}\boldsymbol{Z} - \boldsymbol{M}\Vert_F^2\quad\text{s.t.}\quad \left\{\begin{aligned} &S\subset\{0,1,\cdots,m-1\},|S|=r\\ &\boldsymbol{Z}\in\mathbb{R}^{r\times m} \end{aligned}\right. \end{equation} And the low-rank approximation found by SVD is: \begin{equation} \mathop{\text{argmin}}_{\boldsymbol{U},\boldsymbol{\Sigma},\boldsymbol{V}}\Vert \boldsymbol{U}_{[:,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:,:r]}^{\top} - \boldsymbol{M}\Vert_F^2\quad\text{s.t.}\quad \left\{\begin{aligned} &\boldsymbol{U}\in\mathbb{R}^{n\times n}, \boldsymbol{U}^{\top}\boldsymbol{U} = \boldsymbol{I}_n \\ &\boldsymbol{V}\in\mathbb{R}^{m\times n}, \boldsymbol{V}^{\top}\boldsymbol{V} = \boldsymbol{I}_m \\ &\boldsymbol{\Sigma}=\text{diag}(\sigma_1,\cdots,\sigma_{\min(n,m)})\in\mathbb{R}_{\geq 0}^{n\times m} \end{aligned}\right. \end{equation} In the SVD article, we proved that SVD can find the optimal solution for a rank-r approximation, but its computational complexity is high, and the physical meaning of \boldsymbol{U} and \boldsymbol{V} is not intuitive. In contrast, CUR decomposition replaces \boldsymbol{U} and \boldsymbol{V} with columns \boldsymbol{\mathcal{C}} and rows \boldsymbol{\mathcal{R}} from the original matrix. Although it is inferior to SVD in terms of approximation accuracy, it is superior in terms of interpretability, storage cost, and computational cost.
From an appearance perspective, the left and right matrices \boldsymbol{U}, \boldsymbol{V} of the SVD approximation are more complex while the middle matrix \boldsymbol{\Sigma} is simpler. CUR decomposition is the opposite: the left and right matrices \boldsymbol{\mathcal{C}}, \boldsymbol{\mathcal{R}} are simpler while the middle matrix \boldsymbol{\mathcal{U}} is more complex.
Choice of U
Obviously, the difficulty of CUR decomposition lies in the selection of rows and columns, because once \boldsymbol{\mathcal{C}} and \boldsymbol{\mathcal{R}} are given, the optimal solution for \boldsymbol{\mathcal{U}} can be analytically expressed using the pseudo-inverse: \begin{equation} \boldsymbol{\mathcal{U}}^* = \boldsymbol{\mathcal{C}}^{\dagger}\boldsymbol{M}\boldsymbol{\mathcal{R}}^{\dagger} \end{equation} The derivation process can refer to the post on pseudo-inverses. In fact, this solution is also quite intuitive. Assuming \boldsymbol{\mathcal{C}} and \boldsymbol{\mathcal{R}} are both invertible matrices, the solution to the equation \boldsymbol{\mathcal{C}}\boldsymbol{\mathcal{U}}\boldsymbol{\mathcal{R}}=\boldsymbol{M} is naturally \boldsymbol{\mathcal{U}}=\boldsymbol{\mathcal{C}}^{-1}\boldsymbol{M}\boldsymbol{\mathcal{R}}^{-1}. When they are not invertible, the inverse {}^{-1} is replaced by the pseudo-inverse {}^{\dagger}.
In addition to this theoretical optimal solution, there is another frequently used and, in some sense, more intuitive choice for CUR decomposition: \begin{equation} \boldsymbol{\mathcal{U}} = \boldsymbol{M}_{[S_2,S_1]}^{\dagger} \end{equation} Note that the slicing operation has higher priority than transposition and pseudo-inverse, so this \boldsymbol{\mathcal{U}} is actually the pseudo-inverse of the submatrix formed by the intersection of \boldsymbol{\mathcal{C}} and \boldsymbol{\mathcal{R}}.
How to understand this choice? By swapping rows and columns, we can arrange the selected rows and columns at the front. Assuming \boldsymbol{M}_{[S_2,S_1]} is invertible, the result can be written using block matrices as: \begin{equation} \underbrace{\begin{pmatrix}\boldsymbol{A} & \boldsymbol{B} \\ \boldsymbol{C} & \boldsymbol{D}\end{pmatrix}}_{\boldsymbol{M}} \approx \underbrace{\begin{pmatrix}\boldsymbol{A} \\ \boldsymbol{C}\end{pmatrix}}_{\boldsymbol{\mathcal{C}}}\,\,\underbrace{\boldsymbol{A}^{-1}}_{\boldsymbol{\mathcal{U}}}\,\,\underbrace{\begin{pmatrix}\boldsymbol{A} & \boldsymbol{B}\end{pmatrix}}_{\boldsymbol{\mathcal{R}}} = \begin{pmatrix}\boldsymbol{A} & \boldsymbol{B} \\ \boldsymbol{C} & \boldsymbol{C}\boldsymbol{A}^{-1}\boldsymbol{B}\end{pmatrix}\label{eq:id-abcd} \end{equation} As can be seen, the CUR decomposition at this point exactly reconstructs the selected \boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C} (or rather \boldsymbol{\mathcal{C}}, \boldsymbol{\mathcal{R}}), and uses \boldsymbol{C}\boldsymbol{A}^{-1}\boldsymbol{B} to approximate \boldsymbol{D}. In this case, CUR decomposition is equivalent to a "Matrix Completion" method.
It is worth pointing out that since both choices of \boldsymbol{\mathcal{U}} use the pseudo-inverse, and the definition of the pseudo-inverse does not require square matrices, the most general CUR decomposition does not actually require \boldsymbol{\mathcal{C}} and \boldsymbol{\mathcal{R}} to have the same number of columns and rows. If necessary, we can choose different numbers of columns for \boldsymbol{\mathcal{C}} and rows for \boldsymbol{\mathcal{R}}.
Row and Column Selection
After solving for \boldsymbol{\mathcal{U}}, the main task is the selection of \boldsymbol{\mathcal{C}} and \boldsymbol{\mathcal{R}}. Since the selection of rows and columns is essentially equivalent, we will take column selection as an example.
That is to say, our task is to select r key columns from the matrix \boldsymbol{M} to serve as its "skeleton," also called "profile" or "sketch." We have already explored this problem in the previous two articles (namely the CR article and the ID article). The schemes therein can also be used to construct \boldsymbol{\mathcal{C}} and \boldsymbol{\mathcal{R}} for CUR decomposition, including:
1. Selecting the r columns with the largest norms;
2. Randomly sampling r columns with norms as weights;
3. Uniformly random sampling of r columns;
4. Selecting the first r columns using column-pivoted QR decomposition.
These schemes have their own advantages and disadvantages, applicable scenarios, and implicit assumptions. In addition, we can also consider more intuitive approaches. For example, considering that the meaning of key columns is similar to "cluster centers," we can cluster n column vectors into k classes and then select the k vectors closest to the cluster centers. When n is truly too large, we can first randomly sample a portion and then execute the aforementioned selection algorithm on those vectors.
In general, column selection is a classic problem in matrix approximation. English keywords include Randomized Linear Algebra, Column Subset Selection, etc. You can find a lot of information by searching for these.
Leverage Scores
Of course, as a new article, it is best to introduce some new methods. Next, we will introduce two other ideas for column selection. The first is called "Leverage Scores," which performs column selection through the idea of linear regression.
First, we treat the matrix \boldsymbol{M} as m samples of n dimensions. Correspondingly, there are m vectors of d dimensions forming a target matrix \boldsymbol{Y}. Our task is to use \boldsymbol{M} to predict \boldsymbol{Y} using the simplest linear model, with the optimization objective being least squares: \begin{equation} \boldsymbol{W}^* = \mathop{\text{argmin}}_{\boldsymbol{W}} \Vert\boldsymbol{Y} - \boldsymbol{W}\boldsymbol{M}\Vert_F^2\label{eq:linear-loss} \end{equation} We already solved this objective in the pseudo-inverse article; the answer is \boldsymbol{W}^* = \boldsymbol{Y}\boldsymbol{M}^{\dagger}. Assuming n < m and the rank of \boldsymbol{M} is n, it can be further written as \boldsymbol{W}^* = \boldsymbol{Y}\boldsymbol{M}^{\top}(\boldsymbol{M}\boldsymbol{M}^{\top})^{-1}, so we have: \begin{equation} \hat{\boldsymbol{Y}} = \boldsymbol{W}^*\boldsymbol{M} = \boldsymbol{Y}\boldsymbol{M}^{\top}(\boldsymbol{M}\boldsymbol{M}^{\top})^{-1}\boldsymbol{M} = \boldsymbol{Y}\boldsymbol{H} \end{equation} Here \boldsymbol{H}=\boldsymbol{M}^{\top}(\boldsymbol{M}\boldsymbol{M}^{\top})^{-1}\boldsymbol{M} is called the "Hat Matrix," supposedly because it transforms \boldsymbol{Y} into \hat{\boldsymbol{Y}}, as if putting a hat (i.e., \hat{}) on \boldsymbol{Y}. Let \boldsymbol{m}_i be the i-th column vector of \boldsymbol{M}, which is the i-th sample here. We consider: \begin{equation} \boldsymbol{H}_{i,i} = \boldsymbol{m}_i^{\top}(\boldsymbol{M}\boldsymbol{M}^{\top})^{-1}\boldsymbol{m}_i \end{equation} to measure the role of this sample in predicting \hat{\boldsymbol{Y}}. This is the "Leverage Score." We believe that selecting r key columns is equivalent to selecting the r most important samples, so we can choose the r columns with the largest \boldsymbol{H}_{i,i}.
When \boldsymbol{M}\boldsymbol{M}^{\top} is not invertible, the paper "Input Sparsity Time Low-Rank Approximation via Ridge Leverage Score Sampling" generalizes Leverage Scores to "Ridge Leverage Score," which actually adds a regularization term to the objective \eqref{eq:linear-loss} to make it invertible. But in fact, we know that the concept of pseudo-inverse does not require full rank, so the pseudo-inverse can be calculated directly through SVD, which eliminates the need for an additional regularization term.
Let the SVD of \boldsymbol{M} be \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}, then: \begin{equation} \boldsymbol{H} = \boldsymbol{M}^{\dagger}\boldsymbol{M} = (\boldsymbol{V}\boldsymbol{\Sigma}^{\dagger}\boldsymbol{U}^{\top})(\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}) = \boldsymbol{V}(\boldsymbol{\Sigma}^{\dagger}\boldsymbol{\Sigma})\boldsymbol{V}^{\top} \end{equation} Assuming the rank of \boldsymbol{M} is \gamma (where \gamma is not r), then according to the calculation rules of the pseudo-inverse, \boldsymbol{\Sigma}^{\dagger}\boldsymbol{\Sigma} is an m\times m diagonal matrix where the first \gamma elements on the diagonal are all 1 and the rest are 0. Therefore: \begin{equation} \boldsymbol{H} = \boldsymbol{V}_{[:,:\gamma]}\boldsymbol{V}_{[:,:\gamma]}^{\top}\quad\Rightarrow\quad \boldsymbol{H}_{i,i} = \Vert\boldsymbol{V}_{[i-1,:\gamma]}\Vert^2 \end{equation} Note that \boldsymbol{H}_{i,i} represents the element in the i-th row and i-th column of \boldsymbol{H}, with counting starting from 1, but the slicing rules follow Python, where counting starts from 0, so the final slice is {}_{[i-1,:\gamma]}. Now we see that to calculate the leverage scores of the columns of \boldsymbol{M}, we only need to perform SVD as \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top} and then calculate the squared norm of each row of \boldsymbol{V}_{[:,:\gamma]}. Similarly, to calculate the leverage scores of the rows, we only need to calculate the squared norm of each row of \boldsymbol{U}_{[:,:\gamma]}.
Since \boldsymbol{V} itself is an orthogonal matrix, it always holds that: \begin{equation} \sum_{i=1}^m \boldsymbol{H}_{i,i} = \sum_{i=1}^m\Vert\boldsymbol{V}_{[i-1,:\gamma]}\Vert^2 = \gamma \end{equation} Therefore, in addition to selecting the r columns with the largest leverage scores, one can also construct a distribution p_i = \boldsymbol{H}_{i,i} / \gamma for random sampling.
Leverage scores are related to the rank \gamma of \boldsymbol{M}, and the rank of \boldsymbol{M} is equal to the number of non-zero singular values. Thus, it is affected by singular values close to zero. However, these small singular values actually have little effect, so in practice, \gamma is generally taken as the number of significant singular values (principal singular values) of \boldsymbol{M}. Another problem with leverage scores is the need for a prior SVD. In practice, most will use some approximate SVD algorithms. As for the content of approximate SVD algorithms, we will discuss them when there is an opportunity later.
DEIM Method
Another column selection method to be introduced is called DEIM, which stands for Discrete Empirical Interpolation Method. We won’t delve into the origin of this name here, but it is generally certain that both leverage scores and DEIM are commonly used column selection methods for CUR decomposition, and DEIM is more closely related to CUR decomposition, making it increasingly popular in recent years.
The starting point of DEIM is the identity \eqref{eq:id-abcd}. Under these \boldsymbol{\mathcal{C}}, \boldsymbol{\mathcal{U}}, \boldsymbol{\mathcal{R}}, the error of the CUR decomposition depends on \Vert \boldsymbol{D} - \boldsymbol{C}\boldsymbol{A}^{-1}\boldsymbol{B}\Vert_F. When will this expression be small? An intuitive idea is that when \boldsymbol{A} is relatively large and \boldsymbol{B}, \boldsymbol{C}, \boldsymbol{D} are all relatively small, the entire expression will certainly be small. But \boldsymbol{A} is a matrix; how do we measure its size? The absolute value of the determinant can serve as a reference indicator. Therefore, a feasible scheme is to choose the corresponding rows and columns that maximize the absolute value of the determinant of \boldsymbol{M}_{[S_2,S_1]}.
Of course, this scheme only has theoretical value because finding the submatrix with the maximum absolute determinant is also NP-Hard. However, it provides a goal, and we can try to find a greedy solution. When r=1, finding the determinant with the maximum absolute value is just finding the element with the maximum absolute value, which is acceptable, and then we can recurse. DEIM follows this idea, but instead of starting from \boldsymbol{M}, it borrows the approach of leverage scores and starts from \boldsymbol{V} after SVD.
Leverage scores transform finding key columns from \boldsymbol{M} into finding key rows from \boldsymbol{V}_{[:,:\gamma]}, with the sorting index being the squared row norm. DEIM, on the other hand, attempts to find key rows by finding a CUR approximation for \boldsymbol{V}. But if the CUR of \boldsymbol{M} isn’t solved yet, and now we have a CUR of \boldsymbol{V}, isn’t it getting more complicated? No, it’s a bit simpler here because \boldsymbol{V} is the result of the SVD of \boldsymbol{M}, and it is already sorted by singular values. Therefore, we can assume that the columns of \boldsymbol{V} are already sorted by importance. Thus, the r most important columns must be \boldsymbol{V}_{[:,:r]}, and we only need to select the rows.
As mentioned before, the solution idea is a greedy algorithm. The most important column is naturally the first column \boldsymbol{V}_{[:,0]}. What about the most important row? We want to choose the row where the intersection with the first column has the maximum absolute value—simply put, the row where the element in the first column has the largest absolute value. This gives us the initial row. Suppose we have already selected k key rows, with the index set S_k. How do we select the (k+1)-th key row? First, we know that the CUR approximation constructed from the already selected k rows and the first k columns is \boldsymbol{V}_{[:,:k]}\boldsymbol{V}_{[S_k,:k]}^{-1}\boldsymbol{V}_{[S_k,:]}. The error for the (k+1)-th column is: \begin{equation} \boldsymbol{V}_{[:,k]} - \boldsymbol{V}_{[:,:k]}\boldsymbol{V}_{[S_k,:k]}^{-1}\boldsymbol{V}_{[S_k,k]} \end{equation} From equation \eqref{eq:id-abcd}, we know that this type of CUR approximation can recover the selected rows and columns, so the components corresponding to the already selected k rows in the above equation must be zero. Therefore, the remaining non-zero element with the largest absolute value must not be in the already selected k rows. We choose its row as the (k+1)-th key row.
In short, DEIM utilizes the fact that SVD has already sorted the column vectors of \boldsymbol{V} to transform the CUR decomposition into a pure row search problem, reducing the search directions, and then solves it through a greedy algorithm. The basis for selection at each step is the row where the error element is largest. For a more detailed introduction and proof, please refer to "A DEIM Induced CUR Factorization" and "CUR Matrix Factorizations: Algorithms, Analysis, Applications".
Summary
This article introduced CUR decomposition, which can be seen as a further extension of the Interpolative Decomposition (ID) introduced in the previous article. Its characteristic is that it simultaneously uses several rows and columns of the original matrix as a skeleton to construct a low-rank approximation.
Reprinting notice: Please include the original address of this article when reprinting: https://kexue.fm/archives/10662
For more details on reprinting, please refer to: "Scientific Space FAQ"