English (unofficial) translations of posts at kexue.fm
Source

Why Do We Prefer Isotropy? An Understanding Based on Steepest Descent

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

From data whitening preprocessing in the machine learning era to the various normalization methods like BatchNorm, InstanceNorm, LayerNorm, and RMSNorm in the deep learning era, these techniques essentially reflect our preference for "Isotropy." Why do we favor isotropic features? What are the practical benefits? Many answers can be found, such as scale alignment, redundancy reduction, and decorrelation, but most remain at a superficial level.

Recently, while reading the paper "The Affine Divergence: Aligning Activation Updates Beyond Normalisation," I gained a new understanding of this problem from an optimization perspective. I personally find it relatively close to the essence of the matter, so I am writing it down to share and discuss with everyone.

Steepest Descent

We start with the simplest linear layer: \begin{equation} \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W} \end{equation} where \boldsymbol{X} \in \mathbb{R}^{b \times d_{in}} is the current layer input, \boldsymbol{W} \in \mathbb{R}^{d_{in} \times d_{out}} is the weight matrix, and \boldsymbol{Y} \in \mathbb{R}^{b \times d_{out}} is the output. Let the loss function be \mathcal{L}(\boldsymbol{Y}) = \mathcal{L}(\boldsymbol{X}\boldsymbol{W}), then we have: \begin{equation} \frac{\partial \mathcal{L}}{\partial\boldsymbol{W}} = \boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}} \end{equation} Taking gradient descent as an example, the update rule is: \begin{equation} \boldsymbol{W} \quad\leftarrow\quad \boldsymbol{W} - \eta\frac{\partial \mathcal{L}}{\partial\boldsymbol{W}} = \boldsymbol{W} - \eta \boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}} \end{equation} The basic principle of gradient descent is the well-known fact that "the negative gradient direction is the direction of steepest descent for the loss." However, this conclusion has a premise. The most critical premise is that the chosen metric is the Euclidean norm. If we change the norm, the direction of steepest descent also changes. We have already discussed this in articles such as "Muon Sequel: Why We Choose to Try Muon?" and "Steepest Descent on Manifolds: 1. SGD + Hypersphere."

Switching Perspectives

This article focuses on another premise that is not so easily noticed: the perspective, or the standpoint.

Assuming we agree that "the negative gradient direction is the direction of steepest descent," the question is: the gradient of whom? Some readers might say it is naturally the gradient of the parameters. This is indeed the standard answer, but it is not necessarily the best one. Parameters are essentially a byproduct of the model; what we actually care about is whether the model’s inputs and outputs constitute the desired mapping.

Therefore, the changes in input and output features are the objects we care about more. If we start from the perspective of features, the conclusion changes. Specifically, when the parameters change from \boldsymbol{W} to \boldsymbol{W} - \eta \boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}, the change in the output features \boldsymbol{Y} is: \begin{equation} \Delta \boldsymbol{Y} = \boldsymbol{X}\left(\boldsymbol{W} - \eta \boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right) - \boldsymbol{X}\boldsymbol{W} = - \eta \boldsymbol{X} \boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}} \end{equation} According to the principle of steepest descent, from the standpoint of \boldsymbol{Y}, if we want the change in \boldsymbol{Y} to make the loss decrease the fastest, it should satisfy \Delta \boldsymbol{Y} \propto -\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}. However, an extra Gram matrix \boldsymbol{X} \boldsymbol{X}^{\top} has appeared, which means it is not moving along the direction of steepest descent.

Isotropy

A naive idea is: it would be great if \boldsymbol{X} \boldsymbol{X}^{\top} were intuitively equal to the identity matrix (or a multiple of it). We know that \boldsymbol{X} \in \mathbb{R}^{b \times d_{in}}. If b \leq d_{in}, then \boldsymbol{X} \boldsymbol{X}^{\top} = \boldsymbol{I} means these b vectors form a set of orthonormal bases. However, in practice, we usually have b > d_{in}, so \boldsymbol{X} \boldsymbol{X}^{\top} = \boldsymbol{I} cannot strictly hold mathematically.

In this case, we can only hope that these b vectors are distributed as uniformly as possible on the unit hypersphere, so that \boldsymbol{X} \boldsymbol{X}^{\top} \approx \boldsymbol{I} holds approximately. This is the reason for isotropy. In other words:

If the input features satisfy isotropy, then the steepest descent on the parameters can simultaneously approximate the steepest descent on the features. This "dual-pronged" approach fully improves the learning efficiency of the model.

We can also prove that if a random variable follows a d_{in}-dimensional standard normal distribution, while satisfying isotropy, its magnitude will also be highly concentrated around \sqrt{d_{in}}, meaning it approximately lies on a hypersphere with radius \sqrt{d_{in}}. Conversely, if we can normalize the input \boldsymbol{X} into a matrix with zero mean and unit covariance, we consider \boldsymbol{X} \boldsymbol{X}^{\top} \approx d_{in}\boldsymbol{I} to also hold approximately, and this operation is whitening.

Normalization Layers

In addition to whitening the data beforehand, we can also introduce normalization operations within the model so that intermediate features can also approximately satisfy the desired properties. More specifically, we try to find a d_{in} \times d_{in} transformation matrix \boldsymbol{A} such that the transformed features \boldsymbol{X}\boldsymbol{A} satisfy (\boldsymbol{X}\boldsymbol{A})(\boldsymbol{X}\boldsymbol{A})^{\top} \approx \boldsymbol{I} as much as possible, i.e., \begin{equation} \min_{\boldsymbol{A}} \Vert \boldsymbol{X}\boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{X}^{\top} - \boldsymbol{I}\Vert_F \end{equation} The solution to this problem can be expressed using the pseudo-inverse: \begin{equation} \boldsymbol{A}\boldsymbol{A}^{\top} = \boldsymbol{X}^{\dagger}(\boldsymbol{X}^{\top})^{\dagger} = (\boldsymbol{X}^{\top}\boldsymbol{X})^{-1} \end{equation} Here we assume that \boldsymbol{X}^{\top}\boldsymbol{X} is invertible. According to the above equation, we obtain a feasible solution \boldsymbol{A} = (\boldsymbol{X}^{\top}\boldsymbol{X})^{-1/2}, and the corresponding transformation is: \begin{equation} \boldsymbol{X}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1/2} \end{equation} This is precisely whitening without centering. Considering that calculating (\boldsymbol{X}^{\top}\boldsymbol{X})^{-1/2} is expensive, if we use a diagonal approximation instead, it corresponds to standardizing each dimension separately. Depending on the granularity, this corresponds to operations like BatchNorm and InstanceNorm. If we care more about the "hypersphere," we can consider standardizing the magnitude of each sample separately, which corresponds to operations like LayerNorm and RMSNorm.

Beyond SGD

Interestingly, the conclusion that "if the input features satisfy isotropy, then the steepest descent on the parameters simultaneously approximates the steepest descent on the features" applies not only to SGD but also to Muon. Consider Muon without momentum; the update rule is: \begin{equation} \boldsymbol{W} \quad\leftarrow\quad \boldsymbol{W} - \eta\mathop{\mathrm{msign}}\left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{W}}\right) = \boldsymbol{W} - \eta \mathop{\mathrm{msign}}\left(\boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right) \end{equation} According to \mathop{\mathrm{msign}}(\boldsymbol{M}) = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2}, we have: \begin{equation} \Delta \boldsymbol{Y} = - \eta \boldsymbol{X} \mathop{\mathrm{msign}}\left(\boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right) = - \eta \boldsymbol{X} \boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}} \left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}^{\top}\boldsymbol{X}\boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right)^{-1/2} \end{equation} From this, it can be seen that if \boldsymbol{X}\boldsymbol{X}^{\top} \approx \boldsymbol{I}, then: \begin{equation} \Delta \boldsymbol{Y} \approx - \eta \frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}} \left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right)^{-1/2} = -\eta\mathop{\mathrm{msign}}\left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right) \end{equation} We know that Muon is the steepest descent under the spectral norm. The above equation implies that if the input features satisfy isotropy, then the spectral norm steepest descent on the parameters simultaneously approximates the spectral norm steepest descent on the features. This is quite elegant! However, for other optimizers like SignSGD, similar properties cannot be reproduced. This difference might be one of the reasons behind Muon’s superiority.

Summary

In this article, we discussed a question: when is the steepest descent at the parameter level exactly the steepest descent at the feature level? The answer is "isotropy," as stated in the title. From this, we derive an explanation for why we favor isotropy—it can synchronize the steepest descent on both levels, thereby improving training efficiency.

When reposting, please include the original address of this article: https://kexue.fm/archives/11549

For more detailed matters regarding reposting, please refer to: Scientific Space FAQ