The Road to Low-Rank Approximation (III): CR · English (unofficial) translations of posts at kexue.fm

In "The Road to Low-Rank Approximation (II): SVD", we proved that Singular Value Decomposition (SVD) provides the optimal low-rank approximation for any matrix. The optimal approximation there was unconstrained, meaning the SVD result only cares about minimizing the error and disregards the specific structure of the matrix. However, in many application scenarios, due to requirements such as interpretability or non-linear processing, we often hope to obtain an approximate decomposition with certain special structures.

Therefore, starting from this article, we will explore low-rank approximations with specific structures. This article will focus on the CR Approximation (Column-Row Approximation), which provides a simple scheme for accelerating matrix multiplication operations.

Problem Background

The general formulation of the optimal rank-r approximation of a matrix is: \begin{equation} \mathop{\text{argmin}}_{\text{rank}(\tilde{\boldsymbol{M}})\leq r}\Vert \tilde{\boldsymbol{M}} - \boldsymbol{M}\Vert_F^2 \label{eq:loss-m2} \end{equation} where \boldsymbol{M}, \tilde{\boldsymbol{M}} \in \mathbb{R}^{n\times m} and r < \min(n,m). In the previous two articles, we discussed two cases:

1. If \tilde{\boldsymbol{M}} has no other constraints, the optimal solution for \tilde{\boldsymbol{M}} is \boldsymbol{U}_{[:,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:,:r]}^{\top}, where \boldsymbol{M}=\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top} is the Singular Value Decomposition (SVD) of \boldsymbol{M}.

2. If we specify \tilde{\boldsymbol{M}}=\boldsymbol{A}\boldsymbol{B} (\boldsymbol{A}\in\mathbb{R}^{n\times r}, \boldsymbol{B}\in\mathbb{R}^{r\times m}), and \boldsymbol{A} (or \boldsymbol{B}) is already given, then the optimal solution for \tilde{\boldsymbol{M}} is \boldsymbol{A} \boldsymbol{A}^{\dagger} \boldsymbol{M} (or \boldsymbol{M} \boldsymbol{B}^{\dagger} \boldsymbol{B}), where {}^\dagger denotes the "pseudo-inverse".

Both results have wide applications, but they do not explicitly introduce a structural connection between \tilde{\boldsymbol{M}} and \boldsymbol{M}. This makes it difficult to intuitively see the relationship between \tilde{\boldsymbol{M}} and \boldsymbol{M}; in other words, \tilde{\boldsymbol{M}} lacks strong interpretability.

Furthermore, if the objective involves non-linear operations such as \phi(\boldsymbol{X}\boldsymbol{W}), we are usually not allowed to use arbitrary real projection matrices for dimensionality reduction. Instead, we are required to use "Selection Matrices." For example, \phi(\boldsymbol{X}\boldsymbol{W})\boldsymbol{S} = \phi(\boldsymbol{X}\boldsymbol{W}\boldsymbol{S}) does not hold for an arbitrary matrix \boldsymbol{S}, but it holds identically for a selection matrix \boldsymbol{S}.

Therefore, we next focus on low-rank approximations under selection matrix constraints. Specifically, given \boldsymbol{X}\in\mathbb{R}^{n\times l} and \boldsymbol{Y}\in\mathbb{R}^{l\times m}, let \boldsymbol{M}=\boldsymbol{X}\boldsymbol{Y}. Our task is to select r columns from \boldsymbol{X} and the corresponding r rows from \boldsymbol{Y} to construct \tilde{\boldsymbol{M}}, i.e., \begin{equation} \mathop{\text{argmin}}_S\Vert \underbrace{\boldsymbol{X}_{[:,S]}}_{\boldsymbol{C}}\underbrace{\boldsymbol{Y}_{[S,:]}}_{\boldsymbol{R}} - \boldsymbol{X}\boldsymbol{Y}\Vert_F^2 \quad \text{s.t.} \quad S\subset \{0,1,\cdots,l-1\}, |S|=r \end{equation} Here S can be understood as a slice, following Python’s slicing rules. We call \boldsymbol{X}_{[:,S]}\boldsymbol{Y}_{[S,:]} the "CR approximation" of \boldsymbol{X}\boldsymbol{Y}. Note that this slicing result can also be described equivalently using a selection matrix. Assuming the 1, 2, \cdots, r-th columns of \boldsymbol{X}_{[:,S]} are the s_1, s_2, \cdots, s_r-th columns of \boldsymbol{X} respectively, we can define a selection matrix \boldsymbol{S}\in\{0,1\}^{l\times r}: \begin{equation} S_{i,j}=\left\{\begin{aligned}&1, &i = s_j \\ &0, &i\neq s_j\end{aligned}\right. \end{equation} That is, the s_j-th element of the j-th column of \boldsymbol{S} is 1, and the rest are 0. Thus, we have \boldsymbol{X}_{[:,S]}=\boldsymbol{X}\boldsymbol{S} and \boldsymbol{Y}_{[S,:]}=\boldsymbol{S}^{\top} \boldsymbol{Y}.

Preliminary Approximation

If we represent \boldsymbol{X} and \boldsymbol{Y} respectively as \begin{equation} \boldsymbol{X} = (\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_l),\quad \boldsymbol{Y}=\begin{pmatrix}\boldsymbol{y}_1^{\top} \\ \boldsymbol{y}_2^{\top} \\ \vdots \\ \boldsymbol{y}_l^{\top}\end{pmatrix} \end{equation} where \boldsymbol{x}_i\in\mathbb{R}^{n\times 1} and \boldsymbol{y}_i\in\mathbb{R}^{m\times 1} are column vectors, then \boldsymbol{X}\boldsymbol{Y} can be written as \begin{equation} \boldsymbol{X}\boldsymbol{Y} = \sum_{i=1}^l \boldsymbol{x}_i\boldsymbol{y}_i^{\top} \end{equation} Finding the optimal CR approximation of \boldsymbol{X}\boldsymbol{Y} can be equivalently written as \begin{equation} \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\{0,1\}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{x}_i\boldsymbol{y}_i^{\top} - \sum_{i=1}^l\boldsymbol{x}_i\boldsymbol{y}_i^{\top}\right\Vert_F^2 \quad \text{s.t.} \quad \sum_{i=1}^l \lambda_i = r \label{eq:xy-l-k} \end{equation} We know that the Frobenius norm of a matrix is essentially calculating the magnitude of the matrix flattened into a vector. Therefore, this optimization problem is essentially equivalent to: given l vectors \boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l\in\mathbb{R}^d, find \begin{equation} \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\{0,1\}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{v}_i - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \quad \text{s.t.} \quad \sum_{i=1}^l \lambda_i = r \label{eq:v-l-k} \end{equation} where \boldsymbol{v}_i = \text{vec}(\boldsymbol{x}_i \boldsymbol{y}_i^{\top}) and d=nm. Letting \gamma_i = 1 - \lambda_i, this can be further simplified to \begin{equation} \mathop{\text{argmin}}_{\gamma_1,\gamma_2,\cdots,\gamma_l\in\{0,1\}}\left\Vert\sum_{i=1}^l \gamma_i \boldsymbol{v}_i\right\Vert^2 \quad \text{s.t.} \quad \sum_{i=1}^l \gamma_i = l-r \label{eq:v-l-k-0} \end{equation} If the author understands correctly, the exact solution to this optimization problem is NP-Hard, so in general, one can only seek approximate algorithms. A simple case that can be solved exactly is when \boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l are pairwise orthogonal. In this case: \begin{equation} \left\Vert\sum_{i=1}^l \gamma_i \boldsymbol{v}_i\right\Vert^2 = \sum_{i=1}^l \gamma_i^2 \Vert\boldsymbol{v}_i\Vert^2 \end{equation} Thus, its minimum value is the sum of the l-r smallest \Vert\boldsymbol{v}_i\Vert^2, which means setting \gamma_i=1 for the l-r vectors with the smallest magnitudes and \gamma_i=0 for the rest. When the pairwise orthogonality condition does not strictly hold, we can still use the selection of the l-r vectors with the smallest magnitudes as an approximate solution. Returning to the original CR approximation problem, we have \Vert\boldsymbol{x}_i\boldsymbol{y}_i^{\top}\Vert_F = \Vert\boldsymbol{x}_i\Vert \Vert \boldsymbol{y}_i\Vert. Therefore, a baseline for the optimal CR approximation of \boldsymbol{X}\boldsymbol{Y} is to retain the r column/row vectors for which the product of the magnitudes of the column vector of \boldsymbol{X} and the corresponding row vector of \boldsymbol{Y} is largest.

Sampling Perspective

Some scenarios allow us to relax Equation [eq:xy-l-k] to: \begin{equation} \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\mathbb{R}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{x}_i\boldsymbol{y}_i^{\top} - \sum_{i=1}^l\boldsymbol{x}_i\boldsymbol{y}_i^{\top}\right\Vert_F^2 \quad \text{s.t.} \quad \sum_{i=1}^l \#[\lambda_i\neq 0] = r \end{equation} where \#[\lambda_i\neq 0] outputs 1 if \lambda_i\neq 0 and 0 otherwise. This relaxed version actually extends the form of the CR approximation from \boldsymbol{C}\boldsymbol{R} to \boldsymbol{C}\boldsymbol{\Lambda}\boldsymbol{R}, where \boldsymbol{\Lambda}\in\mathbb{R}^{r\times r} is a diagonal matrix. This allows us to adjust the diagonal matrix \boldsymbol{\Lambda}\in\mathbb{R}^{r\times r} to achieve higher accuracy. Correspondingly, Equation [eq:v-l-k] becomes: \begin{equation} \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_l\in\mathbb{R}}\left\Vert\sum_{i=1}^l \lambda_i \boldsymbol{v}_i - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \quad \text{s.t.} \quad \sum_{i=1}^l \#[\lambda_i\neq 0] = r \end{equation}

After this relaxation, we can view the problem from a sampling perspective. First, we introduce an arbitrary l-element distribution \boldsymbol{p}=(p_1,p_2,\cdots,p_l), then we can write: \begin{equation} \sum_{i=1}^l\boldsymbol{v}_i = \sum_{i=1}^l p_i\times\frac{\boldsymbol{v}_i}{p_i} = \mathbb{E}_{i\sim \boldsymbol{p}} \left[\frac{\boldsymbol{v}_i}{p_i}\right] \end{equation} That is to say, the mathematical expectation of \boldsymbol{v}_i/p_i is exactly the target we want to approximate. Therefore, we can construct an approximation through independent sampling with replacement from the distribution \boldsymbol{p}: \begin{equation} \sum_{i=1}^l\boldsymbol{v}_i = \mathbb{E}_{i\sim \boldsymbol{p}} \left[\frac{\boldsymbol{v}_i}{p_i}\right] \approx \frac{1}{r}\sum_{j=1}^r \frac{\boldsymbol{v}_{s_j}}{p_{s_j}},\quad s_1,s_2,\cdots,s_r\sim \boldsymbol{p} \end{equation} This means that when i is one of s_1,s_2,\cdots,s_r, \lambda_i = (r p_i)^{-1}, otherwise \lambda_i=0. Readers might wonder why we use independent sampling with replacement instead of sampling without replacement, which seems more consistent with the approximation requirement. The reason is simply that independent sampling with replacement makes the subsequent analysis easier.

So far, our theoretical results are independent of the choice of distribution \boldsymbol{p}, meaning they hold for any \boldsymbol{p}. This provides the possibility of choosing an optimal \boldsymbol{p}. How do we measure the quality of \boldsymbol{p}? Obviously, we want the error of the sampling estimate to be as small as possible. Therefore, we can use the variance of the sampling estimate: \begin{equation} \mathbb{E}_{i\sim \boldsymbol{p}} \left[\left\Vert\frac{\boldsymbol{v}_i}{p_i} - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2\right] = \left(\sum_{i=1}^l \frac{\Vert\boldsymbol{v}_i\Vert^2}{p_i}\right) - \left\Vert\sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \end{equation} to compare different distributions \boldsymbol{p}. Next, using the inequality of arithmetic and geometric means: \begin{equation} \sum_{i=1}^l \frac{\Vert\boldsymbol{v}_i\Vert^2}{p_i} = \left(\sum_{i=1}^l \frac{\Vert\boldsymbol{v}_i\Vert^2}{p_i} + p_i Z^2\right) - Z^2\geq \left(\sum_{i=1}^l 2\Vert\boldsymbol{v}_i\Vert Z\right) - Z^2 \end{equation} The equality holds when \Vert\boldsymbol{v}_i\Vert^2 / p_i = p_i Z^2, from which we obtain the optimal \boldsymbol{p} as: \begin{equation} p_i^* = \frac{\Vert\boldsymbol{v}_i\Vert}{Z},\quad Z = \sum\limits_{i=1}^l \Vert\boldsymbol{v}_i\Vert \end{equation} The corresponding error is: \begin{equation} \mathbb{E}_{i\sim \boldsymbol{p}} \left[\left\Vert\frac{\boldsymbol{v}_i}{p_i} - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2\right] = \left(\sum_{i=1}^l \Vert\boldsymbol{v}_i\Vert\right)^2 - \left\Vert\sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \end{equation} The optimal p_i is exactly proportional to \Vert\boldsymbol{v}_i\Vert, so the r vectors \boldsymbol{v}_i with the highest probabilities are also the r vectors with the largest magnitudes. This connects back to the approximation in the previous section. This result comes from the 2006 paper "Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication". The original intention was to accelerate matrix multiplication. It shows that by sampling the columns/rows of \boldsymbol{X}, \boldsymbol{Y} according to p_i\propto \Vert \boldsymbol{x}_i\boldsymbol{y}_i^{\top}\Vert_F = \Vert \boldsymbol{x}_i\Vert \Vert\boldsymbol{y}_i\Vert and multiplying by (r p_i)^{-1/2}, one can obtain a CR approximation of \boldsymbol{X}\boldsymbol{Y}, thereby reducing the multiplication complexity from \mathcal{O}(lmn) to \mathcal{O}(rmn).

Extended Discussion

Whether sorting by magnitude or random sampling according to p_i\propto \Vert\boldsymbol{v}_i\Vert, both allow us to construct a CR approximation within linear complexity [i.e., \mathcal{O}(l)]. This is ideal for real-time computation, but since sorting or sampling only depends on \Vert\boldsymbol{v}_i\Vert, the accuracy is only average. If we can accept higher complexity, how can we improve the accuracy of the CR approximation?

We can try to change the sorting unit to k-tuples. For simplicity, assume k \leq l-r is a factor of l-r. The number of combinations of choosing k vectors from l vectors \boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l is C_l^k. For each combination \{s_1,s_2,\cdots,s_k\}, we can calculate the magnitude of the vector sum \Vert \boldsymbol{v}_{s_1} + \boldsymbol{v}_{s_2} + \cdots + \boldsymbol{v}_{s_k}\Vert. With this data, we can greedily construct an approximate solution for Equation [eq:v-l-k-0]:

Initialize \Omega = \{1,2,\cdots,l\}, \Theta=\{\}
For t=1,2,\cdots,(l-r)/k, execute:
- \Theta = \Theta\,\cup\,\mathop{\text{argmin}}\limits_{\{s_1,s_2,\cdots,s_k\}\subset \Omega}\Vert \boldsymbol{v}_{s_1} + \boldsymbol{v}_{s_2} + \cdots + \boldsymbol{v}_{s_k}\Vert;
- \Omega = \Omega\,\backslash\,\Theta;
Return \Theta.

Simply put, each time we pick k vectors from the remaining vectors such that their sum has the minimum magnitude, repeating this (l-r)/k times to obtain l-r vectors. This is a natural generalization of sorting by individual vector magnitudes. Its complexity is \mathcal{O}(C_l^k). When k > 1 and l is relatively large, it may be unbearable, which also reflects the complexity of solving the original problem exactly.

Another question worth considering is: if the CR approximation is allowed to be relaxed to \boldsymbol{C}\boldsymbol{\Lambda}\boldsymbol{R}, what is the optimal solution for \boldsymbol{\Lambda}? If we do not restrict the structure of \boldsymbol{\Lambda}, the answer can be given by the pseudo-inverse: \begin{equation} \boldsymbol{\Lambda}^* = \mathop{\text{argmin}}_{\boldsymbol{\Lambda}}\Vert \boldsymbol{C}\boldsymbol{\Lambda}\boldsymbol{R} - \boldsymbol{X}\boldsymbol{Y}\Vert_F^2 = \boldsymbol{C}^{\dagger}\boldsymbol{X}\boldsymbol{Y}\boldsymbol{R}^{\dagger} \end{equation} What if \boldsymbol{\Lambda} must be a diagonal matrix? We can first restate the problem as: given \{\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_r\}\subset\{\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l\}, find \begin{equation} \mathop{\text{argmin}}_{\lambda_1,\lambda_2,\cdots,\lambda_r}\left\Vert\sum_{i=1}^r \lambda_i \boldsymbol{u}_i - \sum_{i=1}^l\boldsymbol{v}_i\right\Vert^2 \end{equation} Let \boldsymbol{U} = (\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_r), \boldsymbol{V} = (\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_l), \boldsymbol{\lambda}=(\lambda_1,\lambda_2,\cdots,\lambda_r)^{\top}. Then the optimization objective can be written as: \begin{equation} \mathop{\text{argmin}}_{\boldsymbol{\lambda}}\left\Vert\boldsymbol{U}\boldsymbol{\lambda} - \boldsymbol{V}\boldsymbol{1}_{l\times 1}\right\Vert^2 \end{equation} This can also be solved via the pseudo-inverse: \begin{equation} \boldsymbol{\lambda}^* = \boldsymbol{U}^{\dagger}\boldsymbol{V}\boldsymbol{1}_{l\times 1} = (\boldsymbol{U}^{\top}\boldsymbol{U})^{-1}\boldsymbol{U}^{\top}\boldsymbol{V}\boldsymbol{1}_{l\times 1} \end{equation} The last equality assumes that \boldsymbol{U}^{\top}\boldsymbol{U} is invertible, which is usually satisfied. If not, (\boldsymbol{U}^{\top}\boldsymbol{U})^{-1} can be replaced by (\boldsymbol{U}^{\top}\boldsymbol{U})^{\dagger}.

The problem now is that directly applying the above formula involves too much computation for the original problem because \boldsymbol{v}_i = \text{vec}(\boldsymbol{x}_i \boldsymbol{y}_i^{\top}), meaning \boldsymbol{v}_i is an mn-dimensional vector. Thus, \boldsymbol{V} is of size mn\times l and \boldsymbol{U} is of size mn\times r, which is difficult to handle when m, n are large. Utilizing \boldsymbol{v}_i = \text{vec}(\boldsymbol{x}_i \boldsymbol{y}_i^{\top}) can help us further simplify the expression. Let \boldsymbol{u}_i = \text{vec}(\boldsymbol{c}_i \boldsymbol{r}_i^{\top}). Then: \begin{equation} \begin{aligned} (\boldsymbol{U}^{\top}\boldsymbol{V})_{i,j} =&\, \langle \boldsymbol{c}_i \boldsymbol{r}_i^{\top}, \boldsymbol{x}_j \boldsymbol{y}_j^{\top}\rangle_F = \text{Tr}(\boldsymbol{r}_i \boldsymbol{c}_i^{\top}\boldsymbol{x}_j \boldsymbol{y}_j^{\top}) = (\boldsymbol{c}_i^{\top}\boldsymbol{x}_j)(\boldsymbol{r}_i^{\top} \boldsymbol{y}_j) \\[5pt] =&\, [(\boldsymbol{C}^{\top}\boldsymbol{X})\otimes (\boldsymbol{R}\boldsymbol{Y}^{\top})]_{i,j} \end{aligned} \end{equation} That is, \boldsymbol{U}^{\top}\boldsymbol{V}=(\boldsymbol{C}^{\top}\boldsymbol{X})\otimes (\boldsymbol{R}\boldsymbol{Y}^{\top}) and \boldsymbol{U}^{\top}\boldsymbol{U}=(\boldsymbol{C}^{\top}\boldsymbol{C})\otimes (\boldsymbol{R}\boldsymbol{R}^{\top}), where \otimes denotes the Hadamard product. After this identity transformation, the computational cost for \boldsymbol{U}^{\top}\boldsymbol{V} and \boldsymbol{U}^{\top}\boldsymbol{U} is significantly reduced.

Summary

This article introduced the CR approximation for matrix multiplication, which is a low-rank approximation with a specific row-column structure. Compared to the optimal low-rank approximation provided by SVD, the CR approximation has more intuitive physical meaning and better interpretability.