English (unofficial) translations of posts at kexue.fm
Source

Finding Alternatives to Normalization via Gradient Approximation

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

I wonder if everyone noticed the recent paper "Transformers without Normalization"? This paper attempts to replace the Normalization layers in Transformer models with an element-wise operation called DyT, aiming to improve speed while maintaining performance. The theme of fundamental architecture itself carries a certain appeal, and with the names of Kaiming He and Yann LeCun attached, the paper attracted significant attention upon release, receiving both praise and criticism.

Coincidentally, a new paper last week, "The Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions", interpreted DyT from the perspective of gradient analysis and differential equations, and proposed new alternatives. I feel this perspective of understanding is very fundamental, so I decided to study and share it.

Preface

DyT stands for Dynamic Tanh. It replaces the Normalization layer with the following operation: \begin{equation} \operatorname{DyT}(\boldsymbol{x}) = \boldsymbol{\gamma} \odot \tanh(\alpha \boldsymbol{x}) + \boldsymbol{\beta} \end{equation} where \alpha, \boldsymbol{\beta}, \boldsymbol{\gamma} are learnable parameters. Since \boldsymbol{\beta} and \boldsymbol{\gamma} are already present in standard Normalization layers, the key here is replacing the Normalize operation with \tanh(\alpha \boldsymbol{x}). \tanh is an element-wise operation, which eliminates the need to calculate statistics like mean and variance.

Regarding DyT, I previously shared some views on Zhihu in the discussion "How to evaluate Meta’s new paper Transformers without Normalization?". Briefly, I am not particularly optimistic. The reason is that Normalization mindlessly stabilizes the forward propagation of the model, thereby leaving more degrees of freedom and possibilities for other aspects of the model (such as performance). Therefore, I do not believe that a simplified universal operation can achieve better results than one with Normalization (No Free Lunch).

In fact, as early as 2021, in "A Brief Discussion on Initialization, Parameterization, and Standardization of Transformers", we discussed the topic of removing Normalization. Related works include SkipInit, ReZero, and Fixup. At that time, I tried several schemes and found that even if they could match Normalization in some aspects, they still had deficiencies in others—for example, pre-training performance was acceptable, but fine-tuning performance was poor. Thus, I did not pursue it further.

Therefore, I now view such works only as explorations of the limits of simplification. Much like "nGPT: Normalized Transformer with Representation Learning on the Hypersphere", which adds Normalization almost everywhere it can be added, these represent extreme explorations in a specific direction.

Gradient Calculation

Of course, being skeptical does not hinder our learning and analysis. To find a replacement or approximation for Normalization, the most direct approach is to start with the gradient. After all, deep learning is essentially about forward and backward propagation, and backward propagation is just calculating gradients, which often plays a fundamental role.

Next, we consider only RMS Norm. Its key operation is: \begin{equation} \boldsymbol{y} = \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert_{RMS}} = \sqrt{d}\times \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert} \label{eq:rms-norm} \end{equation} where \boldsymbol{x}\in\mathbb{R}^d, and \begin{equation} \lVert\boldsymbol{x}\rVert_{RMS} = \frac{\lVert\boldsymbol{x}\rVert}{\sqrt{d}},\qquad \lVert\boldsymbol{x}\rVert = \sqrt{\boldsymbol{x}^2} = \sqrt{\sum_{i=1}^d x_i^2} \end{equation} To find the gradient of \boldsymbol{x} / \lVert\boldsymbol{x}\rVert_{RMS}, it is equivalent to finding the gradient of \boldsymbol{x} / \lVert\boldsymbol{x}\rVert. We can calculate it as follows: \begin{equation} \frac{\boldsymbol{x}+\Delta\boldsymbol{x}}{\lVert\boldsymbol{x}+\Delta\boldsymbol{x}\rVert} = \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}+\Delta\boldsymbol{x}\rVert} + \frac{\Delta\boldsymbol{x}}{\lVert\boldsymbol{x}+\Delta\boldsymbol{x}\rVert} \approx \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}+\Delta\boldsymbol{x}\rVert} + \frac{\Delta\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert} \label{eq:exp-1} \end{equation} The more complex part is expanding \lVert\boldsymbol{x}+\Delta\boldsymbol{x}\rVert = \sqrt{(\boldsymbol{x}+\Delta\boldsymbol{x})^2}: \begin{equation} \begin{aligned} &\,\sqrt{(\boldsymbol{x}+\Delta\boldsymbol{x})^2} \\ \approx&\, \sqrt{\lVert\boldsymbol{x}\rVert^2+2\boldsymbol{x}\cdot\Delta\boldsymbol{x}} \\ =&\, \lVert\boldsymbol{x}\rVert\sqrt{1+2\boldsymbol{x}\cdot\Delta\boldsymbol{x}/\lVert\boldsymbol{x}\rVert^2} \\ =&\, \lVert\boldsymbol{x}\rVert (1+\boldsymbol{x}\cdot\Delta\boldsymbol{x}/\lVert\boldsymbol{x}\rVert^2) \end{aligned} \quad \Rightarrow \quad \begin{aligned} \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}+\Delta\boldsymbol{x}\rVert} \approx&\, \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert}(1-\boldsymbol{x}\cdot\Delta\boldsymbol{x}/\lVert\boldsymbol{x}\rVert^2) \end{aligned} \end{equation} Substituting back into Equation [eq:exp-1] gives: \begin{equation} \frac{\boldsymbol{x}+\Delta\boldsymbol{x}}{\lVert\boldsymbol{x}+\Delta\boldsymbol{x}\rVert} - \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert} \approx \frac{\Delta\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert} - \frac{(\boldsymbol{x}\cdot\Delta\boldsymbol{x})\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert^3}\quad\Rightarrow\quad\nabla_{\boldsymbol{x}} \frac{\boldsymbol{x}}{\lVert\boldsymbol{x}\rVert} = \frac{\boldsymbol{I}}{\lVert\boldsymbol{x}\rVert} - \frac{\boldsymbol{x}\boldsymbol{x}^{\top}}{\lVert\boldsymbol{x}\rVert^3} \end{equation} Finally, substituting back into Equation [eq:rms-norm]: \begin{equation} \nabla_{\boldsymbol{x}} \boldsymbol{y} = \sqrt{d}\left(\frac{\boldsymbol{I}}{\lVert\boldsymbol{x}\rVert} - \frac{\boldsymbol{x}\boldsymbol{x}^{\top}}{\lVert\boldsymbol{x}\rVert^3}\right) = \frac{1}{\lVert\boldsymbol{x}\rVert_{RMS}}\left(\boldsymbol{I} - \frac{\boldsymbol{y}\boldsymbol{y}^{\top}}{d}\right) \label{eq:rms-norm-grad} \end{equation}

DyT Emerges!

Note that \boldsymbol{x} and \boldsymbol{y} are both vectors, so \nabla_{\boldsymbol{x}} \boldsymbol{y} is a matrix (the Jacobian matrix). Now we consider finding an element-wise approximation for RMS Norm, meaning each component is operated on independently: \begin{equation} f(\boldsymbol{x}) = [f(x_1), f(x_2), \cdots, f(x_d)] \end{equation} This independence implies that its Jacobian matrix must be diagonal! We want this approximation to preserve the gradient of RMS Norm as much as possible, so we consider keeping only the diagonal part of Equation [eq:rms-norm-grad]: \begin{equation} \frac{dy_i}{dx_i} = \frac{1}{\lVert\boldsymbol{x}\rVert_{RMS}}\left(1 - \frac{y_i^2}{d}\right) \label{eq:ode-1} \end{equation} If we further assume that \rho = \lVert\boldsymbol{x}\rVert_{RMS} is a constant, we can directly solve the above differential equation to obtain: \begin{equation} y_i = \sqrt{d}\tanh\left(\frac{x_i}{\rho\sqrt{d}}\right) \end{equation} Thus, we obtain the "T" (\tanh) in DyT, where the initial condition y_i(0)=0 was chosen for the solution.

DyT is equivalent to absorbing the leading \sqrt{d} into the \boldsymbol{\gamma} parameter and treating the \frac{1}{\rho\sqrt{d}} inside the parentheses as the training parameter \alpha, which alleviates the restriction imposed by the assumption that "\rho = \lVert\boldsymbol{x}\rVert_{RMS} is constant." However, in my view, explicitly retaining \sqrt{d} might be more valuable, as long as the \frac{1}{\rho} part is treated as a trainable parameter.

DyISRU

I wonder if anyone noticed that for RMS Norm, we always have y_i = x_i / \lVert\boldsymbol{x}\rVert_{RMS}. Therefore, we can replace \lVert\boldsymbol{x}\rVert_{RMS} in Equation [eq:ode-1] with x_i/y_i, resulting in: \begin{equation} \frac{dy_i}{dx_i} = \frac{y_i}{x_i}\left(1 - \frac{y_i^2}{d}\right) \label{eq:ode-2} \end{equation} This is an equation involving only x_i and y_i, eliminating the need for an approximate treatment of \lVert\boldsymbol{x}\rVert_{RMS}. Solving this equation yields: \begin{equation} y_i = \frac{\sqrt{d}x_i}{\sqrt{x_i^2 + C}} \end{equation} where C is an arbitrary constant. This form is known as ISRU (Inverse Square Root Unit; we have also called it SoftSign before), originating from the paper "Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)". If C is treated as a trainable parameter, it can be called DyISRU (Dynamic ISRU) by analogy with DyT.

Looking from the gradient [eq:rms-norm-grad] to Equation [eq:ode-1] and then to [eq:ode-2], DyISRU is the best result we can achieve using an element-wise function, as no additional approximations were added beyond the diagonal assumption. Formally, DyISRU is actually more intuitive than DyT because \lVert\boldsymbol{x}\rVert_{RMS}^2 is \mathbb{E}[x_i^2]. Since we are seeking an element-wise operation, we are forced to replace \mathbb{E}[x_i^2] with x_i^2. Finally, adding C and multiplying by \sqrt{d} serves as a smoothing operation: \begin{equation} \frac{x_i}{\sqrt{\textcolor{red}{\frac{1}{d}\sum\limits_{i=1}^d x_i^2}}} \quad \to \quad \frac{x_i}{\sqrt{\textcolor{green}{x_i^2}}} \quad \to \quad \frac{\textcolor{orange}{\sqrt{d}} x_i}{\sqrt{\textcolor{green}{x_i^2} + \textcolor{orange}{C}}} \end{equation}

Summary

This article analyzes what kind of element-wise activation functions can (to some extent) replace Normalization layers from the perspective of gradient approximation. From this, we can derive DyT as well as new results.

When reposting, please include the original address of this article:
https://kexue.fm/archives/10831

For more detailed reposting matters, please refer to:
"Scientific Space FAQ"