Effective Rank of a Matrix · English (unofficial) translations of posts at kexue.fm

Rank is a fundamental concept in linear algebra, representing the intrinsic dimensionality of a matrix. However, the strict mathematical definition of rank is often not entirely suitable for numerical computation scenarios. This is because rank equals the number of non-zero singular values, and the mathematical understanding of "equal to zero" differs from that in numerical computation. In mathematics, "equal to zero" means absolutely and strictly zero; even 10^{-100} is not zero. But numerical computation is different; in many cases, 10^{-10} can be treated as zero.

Therefore, we hope to generalize the concept of rank into a form that better aligns with the characteristics of numerical computation. This is the origin of the concept of Effective Rank.

Error Truncation

It should be noted that there is currently no unified definition of effective rank in academia. In the following, we introduce several approaches to defining effective rank from different perspectives. For practical problems, readers can choose the definition that best suits their needs.

We primarily consider rank from the perspective of singular values. For a matrix \boldsymbol{M}\in\mathbb{R}^{n\times m}, without loss of generality, let n\leq m. Its standard rank is defined as: \begin{equation} \mathop{\text{rank}}(\boldsymbol{M}) \triangleq \max\{i\,|\,\sigma_i > 0\}\leq n \end{equation} where \sigma_1\geq \sigma_2\geq \cdots\geq\sigma_n \geq 0 are the singular values of \boldsymbol{M}. Intuitively, the concept of effective rank is intended to treat singular values close to zero as zero. Thus, a basic property of effective rank is that it does not exceed the standard rank and can degenerate into the standard rank in special cases. A simple definition satisfying this property is: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M},\epsilon) \triangleq \max\{i\,|\,\sigma_i > \epsilon\} \end{equation} However, the concepts of "large" and "small" should be relative; 1 is large compared to 0.01, but small compared to 100. Therefore, it seems more scientific to normalize by dividing by \sigma_1: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M},\epsilon) \triangleq \max\big\{i\,\big|\,\sigma_i/\sigma_1 > \epsilon\big\} \end{equation}

In addition to directly truncating singular values close to zero, we can also consider this problem from the perspective of low-rank approximation. In "The Road to Low-Rank Approximation (II): SVD", we proved that the optimal rank-r approximation of a matrix is the SVD result retaining only the largest r singular values. Conversely, we can specify a relative error \epsilon and define the effective rank as the minimum rank required to achieve this relative error: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M},\epsilon) \triangleq \min\left\{ r \,\,\left|\,\, \sqrt{\frac{\sum_{i=1}^r \sigma_i^2}{\sum_{i=1}^n \sigma_i^2}} \geq 1-\epsilon \right.\right\} \end{equation} The numerical significance of this definition is clearer, but since it only considers the total error, some examples turn out to be less than elegant. For instance, for an n\times n identity matrix, we expect its effective rank to always be n because every singular value is an identical 1, and there should be no truncation. However, using the above definition, when n is large enough (1/n < \epsilon), the effective rank will be less than n.

Ratio of Norms

While the effective rank definitions in the previous section are intuitive, they all depend on a hyperparameter \epsilon, which is ultimately not concise enough. Our basic understanding now is that the concept of effective rank should only depend on relative magnitudes, so the problem is equivalent to constructing an effective rank from 1\geq \sigma_2/\sigma_1\geq\cdots\geq\sigma_n/\sigma_1\geq 0. Given that they all lie within [0,1], a clever idea is to sum them directly: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M}) \triangleq \sum_{i=1}^n\frac{\sigma_i}{\sigma_1} \end{equation} From "The Road to Low-Rank Approximation (II): SVD", we know that the largest singular value \sigma_1 is the Spectral Norm of the matrix, denoted as \Vert\boldsymbol{M}\Vert_2, while the sum of all singular values is actually a matrix norm called the "Nuclear Norm", usually denoted as \Vert\boldsymbol{M}\Vert_*. Thus, the above formula can be abbreviated as: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M}) \triangleq \frac{\Vert\boldsymbol{M}\Vert_*}{\Vert\boldsymbol{M}\Vert_2} \label{eq:n-2} \end{equation} The earliest source for this might be "An Introduction to Matrix Concentration Inequalities", where it is called the "Intrinsic Dimension," but related properties were already explored in "Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization".

Similarly, we can also construct the effective rank through the sum of squares: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M}) \triangleq \sum_{i=1}^n\frac{\sigma_i^2}{\sigma_1^2} = \frac{\Vert\boldsymbol{M}\Vert_F^2}{\Vert\boldsymbol{M}\Vert_2^2} \label{eq:f-2} \end{equation} Here \Vert\cdot\Vert_F is the Frobenius norm. This definition likely originates from "Sampling from large matrices: an approach through geometric functional analysis", where it was called "Numerical Rank"; today it is more commonly known as "Stable Rank" and is one of the popular concepts of effective rank.

In terms of computational complexity, Equation [eq:f-2] is lower than Equation [eq:n-2]. Calculating the nuclear norm requires all singular values, which implies a full SVD; however, \Vert\boldsymbol{M}\Vert_F^2 is equal to the sum of the squares of all elements in the matrix, so the main computational cost of Equation [eq:f-2] is the largest singular value, which is cheaper than calculating all singular values. If desired, we can generalize Equations [eq:n-2] and [eq:f-2] to the sum of k-th powers, which results in the ratio of a more general Schatten norm to the spectral norm.

Distribution and Entropy

If readers search for "Effective Rank," they will likely find the paper "The Effective Rank: a Measure of Effective Dimensionality". This is also one of the earlier works exploring effective rank, which proposes defining it based on entropy.

First, since singular values are always non-negative, we can normalize them to obtain a probability distribution: \begin{equation} p_i = \frac{\sigma_i^{\gamma}}{\sum_{j=1}^n \sigma_j^{\gamma}} \end{equation} where \gamma > 0. From the literature, both \gamma=1 and \gamma=2 are frequently used (we used \gamma=2 in Moonlight). Below, we take \gamma=1 as an example. With a probability distribution, we can calculate the information entropy (Shannon entropy): \begin{equation} H = -\sum_{i=1}^n p_i \log p_i \end{equation} Recall that the range of entropy is [0, \log n]. Taking the exponential gives e^H \in [1, n]. When the distribution is One-Hot, e^H=1 (only one non-zero singular value); when the distribution is uniform, e^H=n (all singular values are equal). These correspond exactly to two special cases of standard rank, which inspires us to define the effective rank as: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M}) \triangleq e^H = \exp\left(-\sum_{i=1}^n p_i \log p_i\right) \label{eq:h-erank} \end{equation} Substituting the definition of p_i, we find it can be further transformed into: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M}) = \exp\left(\log\sum_{i=1}^n \sigma_i -\frac{\sum_{i=1}^n \sigma_i\log\sigma_i}{\sum_{i=1}^n \sigma_i}\right) \end{equation} Clearly, the first term inside the parentheses after exponentiation is \Vert\boldsymbol{M}\Vert_*; the second term is the weighted average of \log\sigma_i with weights \sigma_i. In this case, \log\sigma_i will be approximately equal to the largest \log\sigma_1, and after exponentiation, it becomes \sigma_1=\Vert\boldsymbol{M}\Vert_2. Therefore, the overall result of the above formula will be similar to Equation [eq:n-2]. This indicates that defining effective rank based on entropy, while seemingly a completely different path, is actually remarkably similar to the ratio of norms discussed in the previous section.

We know that the standard rank satisfies the triangle inequality \mathop{\text{rank}}(\boldsymbol{A}+\boldsymbol{B})\leq \mathop{\text{rank}}(\boldsymbol{A}) + \mathop{\text{rank}}(\boldsymbol{B}). The original paper proved that for (semi-)positive definite symmetric matrices \boldsymbol{A} and \boldsymbol{B}, the effective rank defined by Equation [eq:h-erank] satisfies \mathop{\text{erank}}(\boldsymbol{A}+\boldsymbol{B})\leq \mathop{\text{erank}}(\boldsymbol{A}) + \mathop{\text{erank}}(\boldsymbol{B}). It remains unclear whether this inequality can be generalized to general matrices. Currently, proving whether effective rank can maintain certain inequalities of the standard rank is not an easy task.

Sparsity Metrics

From the series of effective rank definitions above, especially the transition from singular values to distributions and then to entropy, I believe some readers have vaguely realized that effective rank shares obvious commonalities with sparsity. In fact, effective rank can be understood as a measure of sparsity for the vector of singular values. Unlike general sparsity metrics, we align the range to 1\leq \mathop{\text{erank}}(\boldsymbol{M}) \leq \mathop{\text{rank}}(\boldsymbol{M}) \leq n to match the concept of rank, thereby providing a more intuitive perception of the degree of sparsity.

Regarding sparsity measures, we had a systematic discussion in "How to Measure the Sparsity of Data?". Theoretically, the results there can be used to construct effective rank. In fact, we have done just that. The "Ratio of Norms" section, where we constructed effective rank based on the ratio of the Schatten norm to the spectral norm, corresponds only to formula (1) in that article. We can also use other formulas like (16), which is equivalent to defining effective rank as the square of the ratio of the nuclear norm to the Frobenius norm: \begin{equation} \mathop{\text{erank}}(\boldsymbol{M}) \triangleq \frac{\Vert\boldsymbol{M}\Vert_*^2}{\Vert\boldsymbol{M}\Vert_F^2} = \frac{(\sum_{i=1}^n\sigma_i)^2}{\sum_{i=1}^n\sigma_i^2} \end{equation} This also satisfies 1\leq \mathop{\text{erank}}(\boldsymbol{M}) \leq \mathop{\text{rank}}(\boldsymbol{M}) and is a usable definition of effective rank.

It is quite remarkable that our understanding of effective rank and sparsity has reached a closed loop. I must say this is a wonderful experience: when I first started learning about sparsity measures, I knew nothing about effective rank. Over the past few days, while learning about effective rank, I gradually realized that it is essentially connected to sparsity. It seems as if a mysterious force is quietly linking the knowledge we accumulate in different fields and disciplines, eventually converging in the same correct direction.

Summary

This article explored the concept of the effective rank of a matrix, which is an extension of the rank concept from linear algebra into the realm of numerical computation, capable of more effectively measuring the intrinsic dimensionality of a matrix.

When reposting, please include the original address: https://kexue.fm/archives/10847

For more detailed reposting matters, please refer to: "Scientific Space FAQ"