Is a Higher Singular Value Entropy of Matrix Parameters Always Better? · English (unofficial) translations of posts at kexue.fm

In last year’s technical report Muon is Scalable for LLM Training, to compare the differences between models trained by Muon and Adam, we introduced the concept of “singular value entropy” and observed that the matrix parameters trained by Muon generally have higher singular value entropy than those trained by Adam. We regarded this as empirical evidence that “Muon makes fuller use of model parameters”.

However, is higher singular value entropy really always better? Is there an optimal singular value entropy? Or conversely, if we must fix the singular value entropy of matrix parameters in advance, what value should we choose?

Concept Review

First let us review some concepts. We know that singular values of a matrix are non-negative, so we can normalize them, for example by direct normalization or squared normalization: p_i = \frac{\sigma_i}{\sum_{j=1}^n \sigma_j}\qquad\text{or}\qquad p_i = \frac{\sigma_i^2}{\sum_{j=1}^n \sigma_j^2} thus obtaining an n-dimensional probability distribution \boldsymbol{p} = (p_1,p_2,\cdots,p_n). With a probability distribution, we can compute related quantities, such as entropy H(\boldsymbol{p}) = -\sum_{i=1}^n p_i \log p_i This is the concept of “singular value entropy”. We know that maximum entropy corresponds to a uniform distribution, so singular value entropy also measures the uniformity of singular values. However, entropy is not limited to one definition; the above form with log expectation is called “Shannon entropy”. There are other entropies, such as Rényi entropy H_q(\boldsymbol{p}) = -\frac{1}{q-1}\log \sum_{i=1}^n p_i^q In fact, Shannon entropy is a special case corresponding to q\to 1. For other q values, it also attains its maximum \log n at the uniform distribution and minimum 0 at a one-hot distribution. The singular value entropy of a matrix is essentially equivalent to its effective rank; another interpretation is the sparsity of the matrix’s singular values (see How to Measure the Sparsity of Data?).

Problem Transformation

The question now is: how can we convert the doubt “is higher singular value entropy really better?” into a computable proposition, so as to give a reference answer?

Intuitively, since current models are generally over-parameterized, there exist many redundant degrees of freedom, which allow us to impose certain constraints on model parameters—such as constraining their norm or singular value entropy—without significantly degrading performance. After imposing constraints, the degrees of freedom are naturally compressed. We can empirically assume: the greater the remaining degrees of freedom, the stronger the model’s capacity; this is the principle behind seeking the “optimal singular value entropy”.

So, given a fixed singular value entropy, how many degrees of freedom does the matrix retain? Let us first examine two extreme cases: the first is minimum entropy, where only one singular value is nonzero, so the matrix is a rank-1 matrix with at most 2n degrees of freedom; the second is maximum entropy, where all singular values are nonzero and equal, making the matrix (a scalar multiple of) an orthogonal matrix with about n^2/2 degrees of freedom. From this perspective, maximum entropy does seem better than minimum entropy, but is there an even better intermediate value?

We can also set aside the matrix context and look directly at the “entropy density” of the singular value distribution. The maximum entropy of an n-dimensional distribution is \log n, achieved only by the uniform distribution; the minimum entropy is 0, corresponding to a one-hot distribution, with n different choices. Clearly, the “density” at these two boundaries does not appear high; they represent only a small subset of all distributions, so we may guess that the optimal value indeed lies somewhere in the middle.

With this transformation, our problem becomes: if we uniformly sample an n-dimensional distribution and compute its entropy, what does the distribution of entropy look like? Where is the probability density highest? The point of maximum probability density corresponds to the entropy value around which the largest number of distributions cluster, and we regard this as the entropy value with the highest expressive power.

Geometric Picture

Some readers may find “uniformly sampling an n-dimensional distribution” puzzling—usually we are given a probability distribution and sample a number from it; now we need to sample the distribution itself. However, if we temporarily set aside the probabilistic meaning of \boldsymbol{p}, we can obtain a very clear geometric understanding.

Specifically, an n-dimensional probability distribution \boldsymbol{p} is actually an n-dimensional vector satisfying the following constraints: 0 \leq p_i \leq 1 \, (\forall i=1,2,\cdots,n)\qquad \sum_{i=1}^n p_i = 1 The first constraint describes the unit hypercube in n-dimensional space, and the second constraint describes a hyperplane in n-dimensional space. Their intersection yields a finite region, a sub-plane called the “(n-1)-simplex”, denoted by \Delta^{n-1}. So “uniformly sampling an n-dimensional distribution” simply means uniformly sampling a point on this sub-plane, which is much more intuitive.

Some readers might wonder why the symbol \Delta is used for the simplex. In fact, when n=3, this sub-plane is exactly the equilateral triangle with vertices (1,0,0), (0,1,0), and (0,0,1); it is easy to imagine that when n=4, it forms a regular tetrahedron in three-dimensional space. By extension, geometric intuition shows that a simplex is essentially the generalization of an equilateral triangle to higher dimensions, hence the use of the triangle symbol \Delta.

Limit Theorem

Getting back on track. Now consider computing the probability density of entropy, which can be formally written as \rho(h) = \frac{\int_{\boldsymbol{p}\in\Delta^{n-1}} \delta(H(\boldsymbol{p}) - h) d\boldsymbol{p}}{\int_{\boldsymbol{p}\in\Delta^{n-1}} d\boldsymbol{p}} where \delta(\cdot) is the Dirac delta function. The physical meaning of this expression is quite intuitive: it “counts” the distributions with H(\boldsymbol{p})=h and divides by the total number. However, this is a purely formal expression, and actual analytical computation is nearly impossible.

Considering that in practical scenarios n is usually large (hundreds to thousands), we can invoke the central limit theorem and assume that \rho(h) is approximately a normal distribution. A normal distribution is determined by only two parameters, the mean and variance (second moment), so we only need to compute the mean and variance of entropy, i.e., \mathbb{E}_{\boldsymbol{p}\sim\Delta^{n-1}}[H(\boldsymbol{p})] and \mathbb{E}_{\boldsymbol{p}\sim\Delta^{n-1}}[H(\boldsymbol{p})^2]. If we just want the point of maximum probability, we only need to compute the mean, because the maximum probability point of a normal distribution is its mean.

The difficulty in computing the mean \mathbb{E}_{\boldsymbol{p}\sim\Delta^{n-1}}[H(\boldsymbol{p})] lies in how to realize \boldsymbol{p}\sim\Delta^{n-1}; although this sampling is conceptually intuitive, its computational implementation is not straightforward. Readers familiar with the Dirichlet distribution may apply known results to speed up understanding, but here we deliberately avoid introducing the Dirichlet distribution to reduce the comprehension burden for general readers.

Sampling Transformation

Fortunately, uniform sampling on the simplex can be transformed into independent repeated sampling from the exponential distribution \text{Exp}(1): \boldsymbol{p}\sim\Delta^{n-1}\qquad\Leftrightarrow\qquad p_i = \frac{x_i}{\sum_{j=1}^n x_j},\quad x_1,x_2,\cdots,x_n\sim \text{Exp}(1) In other words, independently sampling n numbers from the exponential distribution \text{Exp}(1) and then normalizing them is equivalent to uniform sampling on the simplex. How can we understand this? A rigorous proof is certainly possible, but here the author would rather introduce an analogy based on isotropy.

Let us first consider another problem: uniformly sampling an n-dimensional unit vector. This is easy: just sample n numbers (x_1,x_2,\cdots,x_n)=\boldsymbol{x} from a standard normal distribution \mathcal{N}(0,1), then return \boldsymbol{x}/\|\boldsymbol{x}\|. Why does this work? Because the probability density of \mathcal{N}(0,1) is proportional to e^{-x^2/2}; sampling n times yields e^{-\sum_{i=1}^n x_i^2/2} = e^{-\|\boldsymbol{x}\|^2/2}, which depends only on the length of \boldsymbol{x} and not on its direction. Hence sampling from \mathcal{N}(0,1) is uniform with respect to direction.

A unit vector is a vector whose “sum of squares equals 1”; now the probability distribution we want to sample is a vector whose “sum equals 1”. Thus, the probability distribution is a unit vector under the “sum” norm. We know that the probability density of the exponential distribution \text{Exp}(1) is e^{-x}; sampling n times yields e^{-\sum_{i=1}^n x_i}, which depends only on the “sum” norm. Therefore, it is uniform with respect to this new kind of “directional vector”.

Mean Field

After all these preparations, we can now formally start the calculation. Under the new parameterization, the entropy H(\boldsymbol{p}) can be transformed into H(\boldsymbol{p}) = \log \sum_{i=1}^n x_i - \frac{\sum_{i=1}^n x_i \log x_i}{\sum_{i=1}^n x_i} Then taking expectations on both sides yields a rather complicated integral. Interestingly, although this integral is complex, its analytical solution can be obtained through certain techniques. However, here we will not delve into those details; instead, we use a mean-field trick to quickly obtain an approximate solution: \begin{aligned} \mathbb{E}\left[\log \sum_{i=1}^n x_i - \frac{\sum_{i=1}^n x_i \log x_i}{\sum_{i=1}^n x_i}\right] \approx \log \sum_{i=1}^n \mathbb{E}[x_i] - \frac{\sum_{i=1}^n \mathbb{E}[x_i \log x_i]}{\sum_{i=1}^n \mathbb{E}[x_i]} = \log n - (1 - \gamma) \end{aligned} where \gamma is Euler’s constant. Here \mathbb{E}[x_i]=1 is obvious; the slightly tricky part is \mathbb{E}[x_i \log x_i]=1-\gamma, but this is also a classic result related to the derivative of the Gamma function. If you don’t know this result, you can compute it directly with Mathematica.

We know that the maximum entropy of an n-dimensional distribution is \log n; now we have an extra term -(1-\gamma) = -0.42278\cdots, which is exactly what we wanted. It indicates that the point of maximum entropy density lies approximately 0.42 below the maximum entropy. For matrices, the expressive power is stronger when the singular value entropy is around this value, rather than as large as possible. If we need to fix the singular value entropy, we can choose this value.

General Results

Using the same idea, we can compute the expectation of the general Rényi entropy: \begin{aligned} \mathbb{E}[H_q(\boldsymbol{p})] =&\, \frac{1}{q-1}\mathbb{E}\left[q\log \sum_{i=1}^n x_i -\log \sum_{i=1}^n x_i^q\right] \\ \approx&\, \frac{1}{q-1}\left[q\log \sum_{i=1}^n \mathbb{E}[x_i] - \log \sum_{i=1}^n \mathbb{E}[x_i^q]\right] \\ =&\, \frac{1}{q-1}\left[q \log n - \log n\Gamma(q+1)\right] \\ =&\, \log n - \frac{\log \Gamma(q+1)}{q-1} \end{aligned} \label{eq:E-Hq} If q\to 1, the corresponding limit is exactly \log n - (1-\gamma), consistent with the result of the previous section. If q is an integer greater than 1, we can directly write \mathbb{E}[H_q(\boldsymbol{x})] \approx \log n - \frac{\log q\,!}{q-1} If we take q=2, the result is the extremely simple \log (n/2). Since \log (n/2) is the maximum entropy of an n/2-dimensional distribution, this result can be intuitively understood as: the singular value entropy need only “fill up” half of the dimensions, not all of them; otherwise, it would actually lose diversity.

Complete Analysis

In the previous two sections, we focused on estimating the mean, which is sufficient for our original goal. However, for completeness, this section supplements the estimation of variance and discusses why asymptotic normality holds for \rho(h).

First, we rewrite H_q(\boldsymbol{p}) identically as H_q(\boldsymbol{p}) = \log n + \frac{1}{q-1}\Bigg[q\log \underbrace{\frac{1}{n}\sum_{i=1}^n x_i}_u -\log \underbrace{\frac{1}{n}\sum_{i=1}^n x_i^q}_v\Bigg] When q\neq 1, u and v are two different statistics, and the central limit theorem tells us that as n\to\infty, (u,v) asymptotically follow a bivariate normal distribution. Based on x_i\sim \text{Exp}(1), we can find that the mean and variance of u are 1 and 1/n respectively; the mean and variance of v are \Gamma(q+1) and [\Gamma(2q+1) - \Gamma(q+1)^2]/n respectively.

Thus, when q is not large and n is large, they both concentrate around their means, allowing a Taylor expansion around the mean, such as a first-order approximation: H_q(\boldsymbol{p}) \approx \log n + \frac{1}{q-1}\Bigg[q(u-1) - \left(\log \Gamma(q+1) + \frac{v - \Gamma(q+1)}{\Gamma(q+1)}\right)\Bigg] Taking expectation on both sides yields Eq. [eq:E-Hq]. If needed, we could expand to higher orders to obtain remainder terms. Since H_q(\boldsymbol{p}) is now linear in u and v, we can also compute its variance using the following formula: \mathbb{V}\mathrm{ar}[H_q(\boldsymbol{p})] \approx \mathbb{V}\mathrm{ar}\left[\frac{q u}{q-1}\right] + \mathbb{V}\mathrm{ar}\left[\frac{v}{(q-1) \Gamma(q+1)}\right] - 2\mathbb{C}\mathrm{ov}\left[\frac{q u}{q-1},\frac{v}{(q-1) \Gamma(q+1)}\right] Note that u and v are generally not independent, so the covariance term cannot be neglected: \mathbb{C}\mathrm{ov}[u, v] = \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n \mathbb{C}\mathrm{ov}[x_i, x_j^q] = \frac{1}{n^2}\sum_{i=1}^n \mathbb{C}\mathrm{ov}[x_i, x_i^q] The first equality follows from the definitions of u, v and the bilinearity of covariance; the second equality holds because for i\neq j, x_i and x_j are independent, so their covariance is zero. Next we compute \mathbb{C}\mathrm{ov}[x_i, x_i^q] = \mathbb{E}[x_i^{q+1}] - \mathbb{E}[x_i] \mathbb{E}[x_i^q]=\Gamma(q+2) - \Gamma(q+1) = q \Gamma(q+1). Substituting these results, we obtain \mathbb{V}\mathrm{ar}[H_q(\boldsymbol{p})] \approx \frac{1}{n}\left[\frac{\Gamma(2q+1)}{(q-1)^2 \Gamma(q+1)^2} - \frac{q^2+1}{(q-1)^2}\right] When q=2, the result is 1/n; when q\to 1, the result is (\pi^2/3-3)/n.

In summary, the central limit theorem does not directly provide asymptotic normality for the entropy h; we only obtain asymptotic normality for the intermediate variables u and v, from which we estimate the mean and variance of h, showing that the variance of h is also \mathcal{O}(1/n). This indicates that the variance shrinks as n increases, so it is scientifically sound to locate the maximum probability density point via the mean.

Conclusion

This article transforms the question “Is higher singular value entropy always better?” into a mathematically quantifiable proposition, with the core assumption that: after fixing the singular value entropy, the remaining degrees of freedom of the matrix parameters represent its expressive power. We then interpret the point of maximum degrees of freedom as the point of maximum entropy density, and through a series of transformations and approximations, we obtain an optimal entropy value. Of course, this result itself is not particularly important; the significance of this article lies more in providing a reference framework for analyzing related problems.

For reprints, please include the address of this article: https://kexue.fm/archives/11767

For more details on reprinting, please refer to: Science Space FAQ