How to Measure the Sparsity of Data? · English (unofficial) translations of posts at kexue.fm

In machine learning, we often talk about sparsity; for example, we frequently say that attention matrices are usually very sparse. However, have you noticed that we never seem to provide a standard method for measuring the degree of sparsity? That is to say, our previous discussions on sparsity have remained at an intuitive level without quantitative analysis. So the question arises: is there a standard method for measuring sparsity?

After some searching, I found that there are indeed some available indicators, such as l_1/l_2, entropy, etc. However, due to different perspectives of focus, there is no single standard answer for measuring sparsity. This article briefly records my findings.

Basic Results

In a narrow sense, "sparsity" refers to the presence of a large number of zeros in the data. Therefore, the simplest sparsity indicator is to count the proportion of zeros. But if it were just that, attention matrices could not be considered sparse because the result of a softmax is always positive. Thus, it is necessary to generalize the concept of sparsity. A naive idea is to count the proportion of elements whose absolute value does not exceed \epsilon, but how do we determine this \epsilon?

1 is large compared to 0.0001, but small compared to 10,000, so the concepts of large and small are not absolute. Intuitively, a sparse vector contains many numbers close to zero, so the average of its absolute values will certainly be relatively small. Since large and small are relative concepts, we might as well divide this average by the maximum value to obtain a relative result. Therefore, a seemingly reasonable indicator is: S_0(\boldsymbol{x}) = \frac{(|x_1| + |x_2| + \cdots + |x_n|)/n}{\max(|x_1|, |x_2|, \cdots, |x_n|)} \label{eq:start} where \boldsymbol{x} = [x_1, x_2, \cdots, x_n] \in \mathbb{R}^n is the vector whose sparsity needs to be evaluated (hereafter, we assume \boldsymbol{x} is not a constant vector, i.e., it has at least two distinct elements). Vectors with smaller indicators are sparser. However, although this indicator has some rationality, it is not "smooth" enough. The main issue is that the \max operation is extremely susceptible to outliers and does not reflect the statistical characteristics of the data well.

Smooth Homogeneity

Following the logic in "Seeking a Smooth Maximum Function", we replace \max with its smooth approximation. The standard smooth approximation of \max is \text{logsumexp}: \max(|x_1|, |x_2|, \cdots, |x_n|) \approx \frac{1}{k} \log \sum_{i=1}^n e^{k|x_i|} However, using \text{logsumexp} here does not provide a good improvement. First, due to the amplification effect of e^{k|x_i|}, \text{logsumexp} is also easily affected by outliers. Second, \text{logsumexp} lacks positive homogeneity, which is undesirable (if all x_i are multiplied by a positive number \alpha, the sparsity should not change). In the aforementioned article, we also provided a smooth approximation of \max that possesses positive homogeneity, which happens to be the l_p norm (p > 1): \max(|x_1|, |x_2|, \cdots, |x_n|) \approx \left(\sum_{i=1}^n |x_i|^p\right)^{1/p} \triangleq l_p(\boldsymbol{x}) Observing Equation [eq:start] again, we find that its numerator is exactly l_1(\boldsymbol{x})/n. Combining these results, we obtain an indicator for measuring sparsity: S_{1,p}(\boldsymbol{x}) = \frac{l_1(\boldsymbol{x})/n}{l_p(\boldsymbol{x})} \label{eq:s1p} If we are only comparing vectors of a fixed dimension, the /n can be omitted. A common choice is p=2, which yields the l_1/l_2 sparsity indicator; if p \to \infty, it essentially becomes the indicator in [eq:start].

Ideal Properties

When introducing the concept of "entropy" in "Unbearable Entropy: From Entropy and the Maximum Entropy Principle to the Maximum Entropy Model (I)", we found that the mathematical form of entropy can be determined through several ideal properties it should possess. Can sparsity follow this example?

The paper "Comparing Measures of Sparsity" attempted this. it proposed that a sparsity measure S(\boldsymbol{x}) should satisfy the following ideal properties (without loss of generality, assume \boldsymbol{x} is a non-negative vector; if not, take the absolute value of each term):

D1: S([\cdots, x_i - \alpha, \cdots, x_j + \alpha, \cdots]) > S(\boldsymbol{x}), where x_i > x_j and 0 < \alpha < \frac{x_i - x_j}{2}. This property states that if the sum remains constant, the more uniform the vector, the less sparse it is.

D2: S(\alpha\boldsymbol{x}) = S(\boldsymbol{x}), where \alpha > 0. This is easy to understand: sparsity is a relative property. Multiplying all elements by the same factor does not change relative sizes and thus should not change sparsity.

D3: S(\alpha + \boldsymbol{x}) > S(\boldsymbol{x}), where \alpha > 0. This is also intuitive: adding a positive number to all elements moves them further from zero, so sparsity naturally decreases.

D4: S(\boldsymbol{x}) = S(\boldsymbol{x} \circ \boldsymbol{x}) = S(\boldsymbol{x} \circ \boldsymbol{x} \circ \boldsymbol{x}) = \cdots, where \circ denotes vector concatenation. This means that simple data replication does not change sparsity.

P1: For any given i \in \{1, 2, \cdots, n\}, there exists \beta_i > 0 such that for any \alpha > 0, S([\cdots, x_{i-1}, x_i + \beta_i + \alpha, x_{i+1}, \cdots]) < S([\cdots, x_{i-1}, x_i + \beta_i, x_{i+1}, \cdots]). This property expresses that when one element is sufficiently large, it dominates the sparsity of the entire vector.

P2: S(\boldsymbol{x} \circ [0]) < S(\boldsymbol{x}). This is very basic: adding a zero to a vector should increase its sparsity.

The original paper derived various common sparsity indicators and found that only one indicator, the "Gini Index," satisfies all six properties (but this Gini Index is quite complex and will not be introduced here). However, I must remind readers to pay close attention to the derivation process when reading the original paper, as I found that its proof that l_1/l_2 does not satisfy D3 is incorrect. In fact, l_1/l_2 does satisfy D3. As for other derivations, I did not examine them in detail; please verify them yourself.

Reference Proofs

For the two indicators in this article, indicator [eq:start] satisfies all properties except D1 (if the > in D1 is changed to \geq, it is also satisfied). Judging these properties is relatively trivial and will not be discussed in detail here; readers may complete them independently.

As for indicator [eq:s1p], it can be proven to satisfy all properties except D4. The proofs for D3 and P1 are slightly more complex; reference proofs are provided below.

To prove D3, we only need to show that \frac{\left(\sum_{i=1}^n (x_i + \alpha)\right)^p}{\sum_{i=1}^n (x_i + \alpha)^p} is monotonically increasing with respect to \alpha > 0. Taking the logarithm of both sides: p \log \sum_{i=1}^n (x_i + \alpha) - \log \sum_{i=1}^n (x_i + \alpha)^p \triangleq f(\alpha) We only need to prove f'(\alpha) > 0. Differentiating directly: f'(\alpha) = \frac{pn}{\sum_{i=1}^n (x_i + \alpha)} - \frac{p \sum_{i=1}^n (x_i + \alpha)^{p-1}}{\sum_{i=1}^n (x_i + \alpha)^p} f'(\alpha) > 0 is equivalent to \frac{1}{n} \sum_{i=1}^n (x_i + \alpha)^p > \left(\frac{1}{n} \sum_{i=1}^n (x_i + \alpha)\right) \left(\frac{1}{n} \sum_{i=1}^n (x_i + \alpha)^{p-1}\right) This follows directly from the Power Mean Inequality.

The proof strategy for P1 is similar. We only need to prove that when x_i is sufficiently large, \frac{\left(x_i + \alpha + \sum_{j \neq i} x_j\right)^p}{(x_i + \alpha)^p + \sum_{j \neq i} x_j^p} is monotonically decreasing with respect to \alpha > 0. Taking the logarithm: p \log \left(x_i + \alpha + \sum_{j \neq i} x_j\right) - \log \left((x_i + \alpha)^p + \sum_{j \neq i} x_j^p\right) \triangleq g(\alpha) We only need to prove g'(\alpha) < 0. Differentiating directly: g'(\alpha) = \frac{p}{x_i + \alpha + \sum_{j \neq i} x_j} - \frac{p (x_i + \alpha)^{p-1}}{(x_i + \alpha)^p + \sum_{j \neq i} x_j^p} g'(\alpha) < 0 is equivalent to p \sum_{j \neq i} x_j^p < p(x_i + \alpha)^{p-1} \sum_{j \neq i} x_j As long as the x_j are not all zero, when x_i is sufficiently large, the above inequality holds for all \alpha > 0.

Perfect Metric

Although indicator [eq:s1p] does not satisfy D4, if we modify it slightly to: S_{1,p}^*(\boldsymbol{x}) = \frac{l_1(\boldsymbol{x})/n}{l_p(\boldsymbol{x})/n^{1/p}} = n^{(1-p)/p} \frac{l_1(\boldsymbol{x})}{l_p(\boldsymbol{x})} \label{eq:s1p-plus} it satisfies D4. It can also be verified that it satisfies P2. Since the other properties do not involve changes in dimension n, they are satisfied just as they were for indicator [eq:s1p]. In other words, S_{1,p}^*(\boldsymbol{x}) is a "perfect metric" that satisfies all six properties!

From the Power Mean Inequality, we know l_p(\boldsymbol{x})/n^{1/p} \geq l_1(\boldsymbol{x})/n, so S_{1,p}^*(\boldsymbol{x}) \leq 1. Furthermore, since \frac{l_p(\boldsymbol{x})}{l_1(\boldsymbol{x})} = \left(\left(\frac{x_1}{l_1(\boldsymbol{x})}\right)^p + \cdots + \left(\frac{x_n}{l_1(\boldsymbol{x})}\right)^p\right)^{1/p} \leq \left(\frac{x_1}{l_1(\boldsymbol{x})} + \cdots + \frac{x_n}{l_1(\boldsymbol{x})}\right)^{1/p} = 1 it follows that S_{1,p}^*(\boldsymbol{x}) \geq n^{(1-p)/p}. Thus, S_{1,p}^*(\boldsymbol{x}) \in [n^{(1-p)/p}, 1]. Additionally, S_{1,p}^*(\boldsymbol{x}) can be generalized to: S_{q,p}^*(\boldsymbol{x}) = \frac{l_q(\boldsymbol{x})/n^{1/q}}{l_p(\boldsymbol{x})/n^{1/p}} = n^{1/p-1/q} \frac{l_q(\boldsymbol{x})}{l_p(\boldsymbol{x})} \in [n^{1/p-1/q}, 1] As long as p > q > 0, this is also a "perfect metric" satisfying the six properties.

When p=2, q=1, we have: S_{1,2}^*(\boldsymbol{x}) = \frac{1}{\sqrt{n}} \frac{l_1(\boldsymbol{x})}{l_2(\boldsymbol{x})} \in \left[\frac{1}{\sqrt{n}}, 1\right] This result can answer some questions about sparsification. For example, why is L1 regularization conducive to sparsity? Because L1 appears in the numerator of the above formula; the smaller it is, the sparser the result. Why is L2 regularization not conducive to sparsity? Because L2 appears in the denominator; the smaller it is, the less sparse the result. To achieve sparsity more accurately, one should use S_{1,2}^*(\boldsymbol{x}) as a regularization term, which both minimizes L1 and maximizes L2, directly optimizing the sparsity metric.

Specifically, when \boldsymbol{x} has no zero elements, we also have the relationship: S_{1,2}^*(\boldsymbol{x}) = \cos(\text{sign}(\boldsymbol{x}), \boldsymbol{x}) where \text{sign} is the sign function applied element-wise. This formula is very intuitive: \text{sign}(\boldsymbol{x}) can be seen as the "densest" derived vector of \boldsymbol{x}. The degree of sparsity is the similarity between \boldsymbol{x} and its densest derived vector; smaller similarity naturally represents higher sparsity.

If we treat the elements of \boldsymbol{x} as samples of a random variable x, we can also write: S_{1,2}^*(\boldsymbol{x}) = \frac{\mathbb{E}[|x|]}{\sqrt{\mathbb{E}[x^2]}} = \frac{\mathbb{E}[|x|]}{\sqrt{\mathbb{E}[|x|]^2 + \text{Var}[|x|]}} = \frac{1}{\sqrt{1 + \text{Var}[|x|] / \mathbb{E}[|x|]^2}} Taking this further, the square of \frac{\mathbb{E}[|x|]}{\sqrt{\text{Var}[|x|]}} is exactly the "Signal-to-Noise Ratio" (SNR, the ratio of the square of the mean to the variance) of |x|. This tells us that "the lower the SNR of |x|, the sparser x is."

Connection to Entropy

Now let’s return to the attention matrix. Its characteristic is that each row corresponds to a probability distribution, automatically satisfying x_i \geq 0 and x_1 + \cdots + x_n = 1. The most certain probability distribution is the one-hot distribution, which is also the most sparse; the most uncertain distribution is the uniform distribution, which is clearly the least sparse. From these two extremes, we can guess that the sparsity of a probability distribution is related to its uncertainty.

We know that the uncertainty of a probability distribution is generally measured by (Shannon) entropy: H(\boldsymbol{x}) = -\sum_{i=1}^n x_i \log x_i \in [0, \log n] In this case, indicator [eq:s1p-plus] becomes \frac{n^{(1-p)/p}}{l_p(\boldsymbol{x})}, which is a derivative of the l_p norm. Since sparsity and uncertainty may be related, does this mean there is a correlation between entropy and the l_p norm?

Indeed. In fact, based on the l_p norm, we can construct Rényi entropy: H_p(\boldsymbol{x}) = -\frac{1}{p-1} \log \sum_{i=1}^n x_i^p It can be proven that H_1(\boldsymbol{x}) = \lim_{p \to 1} H_p(\boldsymbol{x}) = H(\boldsymbol{x}), meaning that when p \to 1, it corresponds exactly to classical Shannon entropy. When p \neq 1, it is general Rényi entropy (in some contexts, Rényi entropy specifically refers to the case p=2). Every type of Rényi entropy can serve as a measure of uncertainty; their range is [0, \log n], and they all take the minimum value at the one-hot distribution and the maximum value at the uniform distribution. In this sense, all Rényi entropies are equivalent to some extent, which explains the connection between entropy and the l_p norm.

It is worth mentioning that Rényi entropy with p \neq 1 is often more numerically friendly. This is because \sum_{i=1}^n x_i^p is necessarily a positive, bounded result, so the \log operation does not have to worry about \log 0, and the \log only needs to be calculated once for the final result. In contrast, standard Shannon entropy requires calculating \log for each x_i, which increases computation and requires special clipping to prevent \log 0.

Summary

This article systematically organized the problem of measuring sparsity and discussed its connections with concepts such as L1, L2, and entropy.

If you reprint this article, please include the original address: https://kexue.fm/archives/9595

For more detailed reprinting matters, please refer to: "Scientific Space FAQ"