The Road to Low-Rank Approximation (IV): Interpolative Decomposition (ID) · English (unofficial) translations of posts at kexue.fm

The protagonist of this article is ID (Interpolative Decomposition). It can be understood as a low-rank decomposition with a specific structure, where one side consists of several columns of the original matrix (of course, if you prefer rows, selecting rows is also fine). In other words, ID attempts to find several key columns from a matrix to serve as a "skeleton" (usually also called a "sketch") to approximate the original matrix.

Many readers may have never heard of ID; even Wikipedia only has a few vague sentences of introduction (link). However, in fact, ID has long been built into SciPy just like SVD (refer to scipy.linalg.interpolative), which indirectly confirms the practical value of ID.

Basic Definition

In the first three articles, we introduced the pseudo-inverse, SVD, and CR approximation. All of these can be seen as seeking a low-rank approximation with a specific structure: \begin{equation} \mathop{\text{argmin}}_{\text{rank}(\tilde{\boldsymbol{M}})\leq r}\Vert \tilde{\boldsymbol{M}} - \boldsymbol{M}\Vert_F^2 \end{equation} where \boldsymbol{M}\in\mathbb{R}^{n\times m}. When no other constraints are added, the optimal solution is given by SVD. When we stipulate \tilde{\boldsymbol{M}}=\boldsymbol{A}\boldsymbol{B} and one of \boldsymbol{A} or \boldsymbol{B} is already given, the optimal solution for the other half can be provided by the pseudo-inverse. If we stipulate \boldsymbol{M}=\boldsymbol{X}\boldsymbol{Y} and \tilde{\boldsymbol{M}}=\boldsymbol{X}_{[:, S]}\boldsymbol{Y}_{[S,:]}, then this is the problem addressed by CR approximation.

CR approximation constructs a low-rank approximation by selecting rows/columns of the original matrix, which makes the approximation results more interpretable and also applicable to some non-linear scenarios. However, the premise of CR approximation is that the matrix \boldsymbol{M} itself is formed by the multiplication of two matrices; its original intention is to reduce the amount of matrix computation. For scenarios where the matrix \boldsymbol{M} is given directly, a similar low-rank approximation is provided by ID.

Specifically, in ID, we have \tilde{\boldsymbol{M}}=\boldsymbol{C}\boldsymbol{Z}, where \boldsymbol{C}=\boldsymbol{M}_{[:,S]} consists of several columns of \boldsymbol{M}, and \boldsymbol{Z} is arbitrary. That is, it uses several columns of \boldsymbol{M} as a skeleton to approximate itself: \begin{equation} \mathop{\text{argmin}}_{S,\boldsymbol{Z}}\Vert \underbrace{\boldsymbol{M}_{[:,S]}}_{\boldsymbol{C}}\boldsymbol{Z} - \boldsymbol{M}\Vert_F^2\quad\text{s.t.}\quad S\subset\{0,1,\cdots,m-1\},|S|=r,\boldsymbol{Z}\in\mathbb{R}^{r\times m} \end{equation} According to the results in "The Road to Low-Rank Approximation (I): Pseudo-inverse", if \boldsymbol{C} is already determined, the optimal solution for \boldsymbol{Z} is \boldsymbol{C}^{\dagger} \boldsymbol{M}. Therefore, the actual difficulty of ID lies only in the optimization of S, i.e., the selection of columns. This is a combinatorial optimization problem, and solving it exactly is NP-Hard. Thus, the main goal is to find approximation algorithms with appropriate efficiency and accuracy.

Geometric Meaning

Before attempting to solve it, let’s further understand the geometric meaning of ID, which helps us better understand its application scenarios and solution ideas. We first represent \boldsymbol{C} in column vector form \boldsymbol{C}=(\boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r). Then, for any column vector \boldsymbol{z}=(z_1,z_2,\cdots,z_r)^{\top}, we have: \begin{equation} \boldsymbol{C}\boldsymbol{z} = \begin{pmatrix}\boldsymbol{c}_1 & \boldsymbol{c}_2 & \cdots & \boldsymbol{c}_r\end{pmatrix}\begin{pmatrix}z_1 \\ z_2 \\ \vdots \\ z_r\end{pmatrix} = \sum_{i=1}^r z_i \boldsymbol{c}_i \end{equation} So the geometric meaning of \boldsymbol{C}\boldsymbol{z} is a linear combination of the column vectors of \boldsymbol{C}. Note that \boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r are selected from \boldsymbol{M}=(\boldsymbol{m}_1,\boldsymbol{m}_2,\cdots,\boldsymbol{m}_m). Thus, ID means selecting several columns as (approximate) basis vectors and representing the remaining columns as linear combinations of these bases. This is the meaning of "Interpolative" in ID.

We know that "Interpolative" more accurately means "interpolation." To better highlight this "interpolation" characteristic, some literature adds the condition |z_{i,j}| \leq 1 to the definition of ID (where z_{i,j} is any element of matrix \boldsymbol{Z}). Of course, this condition is actually quite strict, and ensuring it holds strictly may also be NP-Hard. Therefore, many literatures relax it to |z_{i,j}| \leq 2. The actual performance of most approximation algorithms allows this bound to hold. If there are no other requirements and only the optimality of the approximation error is considered, this restriction can also be removed.

QR Decomposition

Solving algorithms for ID are divided into two categories: deterministic algorithms and randomized algorithms. Deterministic algorithms involve more computation but often provide better approximations; conversely, randomized algorithms are more computationally efficient but slightly less accurate. Note that these are only approximation algorithms with acceptable practical performance, and they do not rule out the possibility of extreme cases where they might fail completely.

The first algorithm considered a standard approximation is based on QR decomposition, more specifically, Column-Pivoting QR decomposition. Why is ID linked to QR decomposition? We can understand this from the perspective of how to find \boldsymbol{Z}.

As mentioned earlier, if \boldsymbol{C} is given, the optimal solution for \boldsymbol{Z} is \boldsymbol{C}^{\dagger}\boldsymbol{M}. This answer is correct but not intuitive. Without loss of generality, assume \boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r are linearly independent. From a geometric perspective, finding the optimal approximation in the form \boldsymbol{C}\boldsymbol{Z} is actually projecting each column vector of \boldsymbol{M} onto the r-dimensional subspace formed by \boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r. To find this projection, we can first perform Gram-Schmidt orthogonalization on \boldsymbol{c}_1,\boldsymbol{c}_2,\cdots,\boldsymbol{c}_r to turn them into a set of orthonormal bases. Projecting onto orthonormal bases is much simpler, and the process of orthogonalization naturally corresponds to QR decomposition.

Gram-Schmidt orthogonalization recursively executes the following steps: \begin{equation} \boldsymbol{q}_1 = \frac{\boldsymbol{c}_1}{\Vert\boldsymbol{c}_1\Vert},\quad \boldsymbol{q}_k = \frac{\hat{\boldsymbol{q}}_k}{\Vert\hat{\boldsymbol{q}}_k\Vert},\quad\hat{\boldsymbol{q}}_k = \boldsymbol{c}_k - \sum_{i=1}^{k-1} (\boldsymbol{c}_k^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i,\quad k = 2,3,\cdots,r \end{equation} The result represents \boldsymbol{C} as: \begin{equation} \boldsymbol{C} = \underbrace{\begin{pmatrix}\boldsymbol{q}_1 & \boldsymbol{q}_2 & \cdots & \boldsymbol{q}_r\end{pmatrix}}_{\boldsymbol{Q}}\underbrace{\begin{pmatrix}R_{1,1} & R_{1,2} & \cdots & R_{1,r} \\ 0 & R_{2,2} & \cdots & R_{2,r} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{r,r} \\ \end{pmatrix}}_{\boldsymbol{R}} \end{equation} With \boldsymbol{q}_1,\boldsymbol{q}_2,\cdots,\boldsymbol{q}_r, the optimal approximation and error of the k-th column \boldsymbol{m}_k of matrix \boldsymbol{M} on \boldsymbol{C} are respectively: \begin{equation} \sum_{i=1}^r (\boldsymbol{m}_k^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i\qquad\text{and}\qquad \left\Vert\boldsymbol{m}_k - \sum_{i=1}^r (\boldsymbol{m}_k^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i\right\Vert^2 \end{equation}

Column-Pivoting QR

Of course, the above results are obtained under the premise that \boldsymbol{C} is known. How do we pick a relatively good set of r columns from \boldsymbol{M} to form \boldsymbol{C}? Column-Pivoting QR decomposition provides a reference answer.

Generally, if we were to perform Gram-Schmidt orthogonalization on \boldsymbol{m}_1,\boldsymbol{m}_2,\cdots,\boldsymbol{m}_m, we would do it in order, starting from \boldsymbol{m}_1, followed by \boldsymbol{m}_2, \boldsymbol{m}_3, \cdots. Column-Pivoting QR decomposition modifies the orthogonalization order based on the norm. Written as a formula: \begin{equation} \begin{gathered} \boldsymbol{q}_1 = \frac{\boldsymbol{m}_{\rho_1}}{\Vert\boldsymbol{m}_{\rho_1}\Vert},\quad \boldsymbol{q}_k = \frac{\hat{\boldsymbol{q}}_k}{\Vert\hat{\boldsymbol{q}}_k\Vert},\quad\hat{\boldsymbol{q}}_k = \boldsymbol{m}_{\rho_k} - \sum_{i=1}^{k-1} (\boldsymbol{m}_{\rho_k}^{\top} \boldsymbol{q}_i)\boldsymbol{q}_i \\ \rho_1 = \mathop{\text{argmax}}_{i\in\{1,2,\cdots,m\}} \Vert \boldsymbol{m}_i\Vert,\quad \rho_k = \mathop{\text{argmax}}_{i\in\{1,2,\cdots,m\}\backslash\{\rho_1,\rho_2,\cdots,\rho_{k-1}\}} \left\Vert \boldsymbol{m}_i - \sum_{j=1}^{k-1} (\boldsymbol{m}_i^{\top} \boldsymbol{q}_j)\boldsymbol{q}_j\right\Vert \end{gathered} \end{equation} Simply put, Column-Pivoting QR decomposition selects the column with the largest remaining error at each step to perform orthonormalization. Except for the change in execution order, the computation of Column-Pivoting QR decomposition is no different from ordinary QR decomposition. Therefore, the final form of Column-Pivoting QR decomposition can be expressed as: \begin{equation} \boldsymbol{M}\boldsymbol{\Pi} = \underbrace{\begin{pmatrix}\boldsymbol{q}_1 & \boldsymbol{q}_2 & \cdots & \boldsymbol{q}_m\end{pmatrix}}_{\boldsymbol{Q}}\underbrace{\begin{pmatrix}R_{1,1} & R_{1,2} & \cdots & R_{1,m} \\ 0 & R_{2,2} & \cdots & R_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & R_{m,m} \\ \end{pmatrix}}_{\boldsymbol{R}} \end{equation} where \boldsymbol{\Pi} is a column permutation matrix. According to the operation of selecting the column with the largest error (norm) at each step, we can obtain that for any k, the norm of the first column of the submatrix \boldsymbol{R}_{[k-1:,k-1:]} is the largest; it is not less than the norm of any remaining column, i.e.: \begin{equation} R_{k,k}^2 \geq \sum_{i=k}^j R_{i,j}^2,\quad \forall j = k,k+1,\cdots,m \end{equation} From this, it can also be inferred that |R_{1,1}|\geq |R_{2,2}| \geq \cdots\geq |R_{m,m}|. These properties lead us to believe that if we want a rank-r approximation of \boldsymbol{M}\boldsymbol{\Pi}, retaining only the first r rows of \boldsymbol{R} should be a good choice: \begin{equation} \boldsymbol{M}\boldsymbol{\Pi} = \boldsymbol{Q}\boldsymbol{R} \approx \boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:]}=\boldsymbol{Q}_{[:,:r]}\big[\boldsymbol{R}_{[:r,:r]},\boldsymbol{R}_{[:r,r:]}\big]=\boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:r]}\big[\boldsymbol{I}_r,\boldsymbol{R}_{[:r,:r]}^{-1}\boldsymbol{R}_{[:r,r:]}\big] \end{equation} Note that we previously agreed that slicing has higher priority than inversion, so \boldsymbol{R}_{[:r,:r]}^{-1} here means (\boldsymbol{R}_{[:r,:r]})^{-1}. It is not hard to find that \boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:r]} is actually r columns of the matrix \boldsymbol{M}, so the above formula actually gives an ID approximation: \begin{equation} \boldsymbol{M} \approx \boldsymbol{C}\boldsymbol{Z},\quad \boldsymbol{C}=\boldsymbol{Q}_{[:,:r]}\boldsymbol{R}_{[:r,:r]},\quad \boldsymbol{Z}=\big[\boldsymbol{I}_r,\boldsymbol{R}_{[:r,:r]}^{-1}\boldsymbol{R}_{[:r,r:]}\big]\boldsymbol{\Pi}^{\top} \end{equation} The above is the ID solving algorithm based on Column-Pivoting QR decomposition, which is also the solving algorithm built into SciPy (rand=False). Note that this algorithm cannot guarantee |z_{i,j}| \leq 1 or |z_{i,j}| \leq 2, but according to feedback from many literatures, |z_{i,j}| > 2 almost never occurs in practice, so it is considered a relatively good solving algorithm. Additionally, SciPy also has built-in Column-Pivoting QR decomposition; setting pivoting=True in scipy.linalg.qr will enable it.

Randomized Solutions

Each step of the orthogonalization in Column-Pivoting QR decomposition requires traversing all remaining vectors to find the one with the maximum error. This is often unacceptable when m is very large. On the other hand, if n is very large, the computation of norms and inner products will also be very high. Thus, randomized algorithms emerged, which aim to reduce the value of n or m to lower the computational complexity.

First, let’s look at the idea of reducing n, i.e., reducing the dimension of each column vector of \boldsymbol{M}. A common method is random projection, which is exactly the same as the "JL Lemma" introduced in "The Amazing Johnson-Lindenstrauss Lemma: Theory". Specifically, assume \boldsymbol{\Omega}\in\mathbb{R}^{d\times n} is some random projection matrix (where d\ll n), whose elements are independently and identically sampled from a distribution such as \mathcal{N}(0,1/n). Then we consider performing Column-Pivoting QR decomposition on the small matrix \boldsymbol{\Omega}\boldsymbol{M}\in\mathbb{R}^{d\times m} to determine the positions of the selected r columns. For a more detailed introduction, refer to "Randomized algorithms for pivoting and for computing interpolatory and CUR factorizations".

According to my limited research, SciPy’s randomized algorithm for solving ID also uses a similar idea, but replaces the randomly sampled matrix with a more structured "Subsampled Randomized Fourier Transform (SRFT)," allowing the computation of \boldsymbol{\Omega}\boldsymbol{M} to be reduced from \mathcal{O}(mnd) to \mathcal{O}(mn\log d). However, I am not familiar with the implementation details of SRFT and SciPy. Interested readers can refer to materials like "Enabling very large-scale matrix computations via randomization" and "A brief introduction to Randomized Linear Algebra" for further investigation.

Another reason for not delving into complex random projection methods like SRFT is that the paper "Efficient Algorithms for Constructing an Interpolative Decomposition" found that simpler column sampling often yields better results and is particularly easy to understand. It involves randomly sampling k > r columns from \boldsymbol{M}, then using Column-Pivoting QR decomposition to select r columns from these k columns to serve as \boldsymbol{C}, and finally solving for \boldsymbol{Z} based on \boldsymbol{C}. This reduces the matrix size for Column-Pivoting QR decomposition from n\times m to n\times k.

Experiments show that such a simple idea at most increases the risk of |z_{i,j}| > 2 in individual tasks, but has a clear advantage in error:

Comparison of the maximum absolute value of the Z matrix between column sampling (Optim-RID) and SciPy’s built-in algorithm (SciPy-RID)

Error comparison between column sampling (Optim-RID) and SciPy’s built-in algorithm (SciPy-RID)

Efficiency comparison between column sampling (Optim-RID) and SciPy’s built-in algorithm (SciPy-RID)

Improving Accuracy

From the above tables, one might notice a perhaps surprising result: Optim-RID, which uses randomized column sampling, is not only superior in error to SciPy-RID (also a randomized algorithm), but in some tasks, it even outperforms the deterministic algorithms SciPy-ID and Optim-ID (which are mathematically equivalent, both based on full Column-Pivoting QR decomposition, differing only in implementation efficiency).

This seemingly counter-intuitive phenomenon actually illustrates a fact: although Column-Pivoting QR decomposition can serve as a good baseline for ID, its ability to select bases might not be much better than random selection. The main role of Column-Pivoting QR decomposition is just to guarantee |z_{i,j}| < 2 with high probability. In fact, this is not hard to understand. Let’s take r=1 as an example. In this case, Column-Pivoting QR decomposition returns the column with the largest norm. But is the column with the largest norm necessarily a good basis (one that minimizes reconstruction error)? Obviously not. A good basis should be the direction that most vectors point towards; the maximum norm does not reflect this.

For ID, Column-Pivoting QR decomposition is essentially a greedy algorithm that greedily transforms the selection of r columns into multiple recursive selections of 1 column. When r=1, in scenarios where m is not too large or high precision is required, the complexity of exact solving through enumeration is acceptable: \begin{equation} \mathop{\text{argmin}}_i \sum_{j=1}^m \left\Vert\boldsymbol{m}_j - \frac{(\boldsymbol{m}_j^{\top} \boldsymbol{m}_i)\boldsymbol{m}_i}{\Vert\boldsymbol{m}_i\Vert^2}\right\Vert^2 \end{equation} That is, traverse all \boldsymbol{m}_i, project all remaining columns onto \boldsymbol{m}_i to calculate the total error, and select the \boldsymbol{m}_i with the minimum total error. Its complexity is proportional to m^2. If the operation of selecting the maximum norm in each step of Column-Pivoting QR decomposition is replaced by selecting the one with the minimum total error as shown above, then better bases can be found, thereby achieving lower reconstruction error (the cost, naturally, is higher complexity, and it becomes even harder to guarantee |z_{i,j}| < 2).

In general, due to the NP-Hard nature of exact solving, there are many approaches to solving ID. The ones listed above are just a few limited examples. Interested readers can search further using keywords like Randomized Linear Algebra and Column Subset Selection. It is particularly worth pointing out that Randomized Linear Algebra, which aims to accelerate various matrix operations through stochastic methods, has itself become a rich discipline. The randomized ID in this article and the sampling-based CR approximation in the previous one are both classic examples of this field.

Summary

This article introduced ID (Interpolative Decomposition), which approximates the original matrix by selecting several of its columns as a "skeleton." It is a low-rank decomposition with a specific structure, and its geometric meaning is relatively intuitive. Its core difficulty lies in the selection of columns, which is essentially an NP-Hard discrete optimization problem.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/10501