In "What are the Difficulties in Training a 1000-layer Transformer?", we introduced the DeepNet technology proposed by Microsoft, which enables the training of a 1000-layer Transformer. Regarding DeepNet, readers generally have two reactions: one is to be amazed and give it a thumbs up, while the other is to feel it is just "old wine in new bottles" and thus uninteresting. Readers with the latter reaction often feel this way because the two improvements proposed by DeepNet—increasing the weight of the identity path and reducing the initialization of the residual branches—are quite commonplace, and similar conclusions have appeared in other works, making it hard to find anything fresh.
Admittedly, from the conclusions alone, DeepNet is indeed not particularly exciting. However, I believe that the process of DeepNet is far more important than the conclusions. Its interest lies in providing a concise and effective method for analyzing gradient magnitudes, which can be used to analyze many related issues. For instance, the question we will discuss in this article—"Why do we need residuals?"—can be given an answer that is closer to the essence of the matter.
Incremental Explosion
Why do we need residuals? The answer is that with residuals, it is easier to train deep models, where "deep" could mean hundreds, thousands, or even tens of thousands of layers. So the question becomes: why is it difficult to train deep models without residuals?
Many readers’ answer would likely be gradient vanishing or gradient explosion. These are indeed two very important issues. However, with specific initialization methods and Normalization techniques, we can already make the gradients of ordinary feedforward neural networks very stable. Yet, even so, training deep feedforward neural networks remains difficult. This indicates that the reason is not just gradient vanishing/explosion, but another problem, which is the "incremental explosion" we discussed in "What are the Difficulties in Training a 1000-layer Transformer?".
Understanding incremental explosion is not difficult. Suppose the loss function is \mathcal{L}(\boldsymbol{\theta}), where \boldsymbol{\theta} represents its parameters. When the parameters change from \boldsymbol{\theta} to \boldsymbol{\theta}+\Delta\boldsymbol{\theta}: \Delta\mathcal{L} = \mathcal{L}(\boldsymbol{\theta}+\Delta\boldsymbol{\theta}) - \mathcal{L}(\boldsymbol{\theta}) \approx \langle\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}),\Delta\boldsymbol{\theta}\rangle For SGD, we have \Delta\boldsymbol{\theta}=-\eta \nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}), so \Delta\mathcal{L} \approx -\eta\Vert\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})\Vert^2. Suppose the model has N layers, and the average number of parameters per layer is K. If the gradient vanishing/explosion problem is solved, we can assume that the gradient of each parameter is of the order \mathcal{O}(1), so \Delta\mathcal{L}=\mathcal{O}(\eta NK). Therefore, the update amount of the model at each step is proportional to the model depth N (width is not within the scope of this discussion). If the model is deeper, the update amount is larger, which means that in the initial stage, the model is more likely to enter a poor local optimum, leading to training stagnation or even collapse. This is the "incremental explosion" problem.
Treating the Symptoms
Simply put, "incremental explosion" means that as the number of layers increases, a tiny change in parameters leads to a large change in the loss function. This is particularly unfavorable for model training, especially in the initial stages. A direct coping technique for this is Warmup, where a very small learning rate is used in the initial stage and then gradually increased to avoid learning too fast initially. Once the model has safely passed the "danger period" of the initial stage, normal training can proceed.
However, although Warmup plays a certain role, it is actually "treating the symptoms rather than the root cause." This is because "a tiny change in parameters leading to a large change in the loss function" implies that the model itself has significant jitter. In more professional terms, the model’s landscape is extremely non-smooth. This is not a property a good model should possess. Therefore, we should solve this problem by modifying the model rather than through "superficial" methods like lowering the learning rate.
Modifying the model means adjusting the model structure or initialization method to naturally offset the impact of the number of layers N on the update amount. According to the previous results \Delta\mathcal{L} \approx -\eta\Vert\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})\Vert^2 and \Delta\mathcal{L}=\mathcal{O}(\eta NK), to offset the impact of the number of layers, we must make the gradient \nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}) of the order \mathcal{O}(1/\sqrt{N}). In other words, the gradient of each parameter must decrease as the number of layers increases.
Stable Propagation
If it were just about purely shrinking the gradient, it would be simple: just reduce the initialization variance as much as possible. But in reality, while shrinking the gradient, we must also maintain forward propagation stability, because the stability of forward propagation is a kind of prior knowledge we have about the task, representing a better starting point for the model. In "A Brief Discussion on the Initialization, Parameterization, and Standardization of Transformers", we discussed that the stability of forward propagation can be measured by the second moment. For a simple linear layer: \boldsymbol{y} = \boldsymbol{x}\boldsymbol{W}, \quad \boldsymbol{x}\in\mathbb{R}^n, \boldsymbol{W}\in\mathbb{R}^{n\times m} We already know that to make the second moment of \boldsymbol{y} equal to the second moment of \boldsymbol{x}, we need an initialization method with a mean of zero and a variance of 1/n. If an activation function is considered, a constant-level scale is added; for example, for the \text{relu} activation function, the variance is changed to 2/n. For backpropagation, we have: \frac{\partial\mathcal{L}}{\partial \boldsymbol{x}} = \frac{\partial\mathcal{L}}{\partial \boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial \boldsymbol{x}} = \frac{\partial\mathcal{L}}{\partial \boldsymbol{y}}\boldsymbol{W}^{\top} As we can see, backpropagation is exactly the opposite. To stabilize the second moment of backpropagation, an initialization method with a mean of zero and a variance of 1/m is required. Xavier initialization takes the average of the two, 2/(n+m). For more details, refer to "Thinking on the Dimension Averaging Strategy for Non-square Matrices in Initialization Methods".
In other words, if we want to stabilize forward propagation, the initialization variance is 1/n, and the second moment of backpropagation becomes m/n times the original. Since m and n are pre-selected hyperparameters that have no necessary connection with the number of layers, we cannot use them to achieve the requirement of reducing the gradient to 1/\sqrt{N} times the original. This means that for a deep feedforward neural network without residuals: \phi_l(\phi_{l-1}(\phi_{l-2}(\cdots\phi_1(\boldsymbol{x}\boldsymbol{W}_1 + \boldsymbol{b}_1)\cdots)\boldsymbol{W}_{l-1} + \boldsymbol{b}_{l-1})\boldsymbol{W}_l + \boldsymbol{b}_l) As long as its forward propagation is stable, the backpropagation is also fixed and cannot make the gradient dependent on the number of layers. Therefore, at most, we can solve the gradient vanishing and explosion problems of deep feedforward neural networks, but we cannot solve the "incremental explosion" problem mentioned at the beginning of this article. Consequently, deep feedforward neural networks are inevitably difficult to train.
The Emergence of Residuals
This is where residuals come into play! Without loss of generality, assuming the input and output dimensions are equal, we consider: \boldsymbol{y} = \boldsymbol{x} + \varepsilon \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\theta}) Obviously, as long as \varepsilon is small enough, forward propagation is necessarily stable; and: \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \boldsymbol{I} + \varepsilon\frac{\partial \boldsymbol{f(\boldsymbol{x};\boldsymbol{\theta})}}{\partial \boldsymbol{x}} \label{eq:bp} So it can also be seen that as long as \varepsilon is small enough, backpropagation is also stable. As for the gradient of the parameters: \frac{\partial \mathcal{L}}{\partial \boldsymbol{\theta}} = \frac{\partial \mathcal{L}}{\partial \boldsymbol{y}}\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{\theta}} = \varepsilon\frac{\partial \mathcal{L}}{\partial \boldsymbol{y}}\frac{\partial \boldsymbol{f(\boldsymbol{x};\boldsymbol{\theta})}}{\partial \boldsymbol{\theta}} This shows that we can control \varepsilon to achieve layer-dependent gradient scaling! For example, if we want to scale the gradient to 1/\sqrt{N}, we can simply let \varepsilon=1/\sqrt{N}.
With this result, we can answer why we need residuals:
Because the residual structure is a design that can simultaneously stabilize forward and backward propagation and scale parameter gradients to solve incremental explosion, thereby helping us train deeper models.
Small Enough
We just said "\varepsilon is small enough" twice, but how small is small enough? Is \varepsilon=1/\sqrt{N} sufficient?
Suppose it is a one-dimensional model, then \frac{\partial y}{\partial x} = 1 + \varepsilon\frac{\partial f}{\partial x}. Generally, we assume \frac{\partial f}{\partial x} is \mathcal{O}(1), so we can approximately use \frac{\partial y}{\partial x}=1+\varepsilon for magnitude estimation. Then, after propagating through N layers, the "expansion coefficient" is approximately (1+\varepsilon)^N. We know that: \left(1 + \frac{1}{N}\right)^N < \lim_{N\to\infty} \left(1 + \frac{1}{N}\right)^N = e That is to say, for a one-dimensional model, to ensure that backpropagation does not explode as the number of layers increases, \varepsilon must be at least \mathcal{O}(1/N). Thus, \varepsilon=1/\sqrt{N} is indeed not quite enough.
However, for high-dimensional models, the situation improves. We multiply both sides of Eq. [eq:bp] by an arbitrary vector \boldsymbol{v}: \boldsymbol{v}\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \boldsymbol{v} + \varepsilon\boldsymbol{v}\frac{\partial \boldsymbol{f(\boldsymbol{x};\boldsymbol{\theta})}}{\partial \boldsymbol{x}} Note that in the initial stage, \frac{\partial \boldsymbol{f(\boldsymbol{x};\boldsymbol{\theta})}}{\partial \boldsymbol{x}} is also equivalent to a zero-mean random initialization matrix. In "Understanding Model Parameter Initialization Strategies from a Geometric Perspective", we discussed that such matrices are close to (a multiple of) an orthogonal matrix. Therefore, in the initial stage, \boldsymbol{v} and \varepsilon\boldsymbol{v}\frac{\partial \boldsymbol{f(\boldsymbol{x};\boldsymbol{\theta})}}{\partial \boldsymbol{x}} are nearly orthogonal. Thus: \left\Vert\boldsymbol{v}\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}\right\Vert^2 = \mathcal{O}\big((1 + \varepsilon^2)\Vert\boldsymbol{v}\Vert^2\big) In short, in the high-dimensional case, the expansion coefficient of each layer is closer to 1+\varepsilon^2 rather than 1+\varepsilon. Based on the discussion of the one-dimensional case, we only need \varepsilon^2=\mathcal{O}(1/N), so \varepsilon=1/\sqrt{N} is basically sufficient.
Summary
This article discussed the question "Why do we need residuals?" Inspired by DeepNet, the conclusion reached is that residuals can simultaneously stabilize forward and backward propagation and solve incremental explosion, making deep models easier to train. In contrast, ordinary feedforward neural networks without residuals cannot solve these three problems simultaneously, making them difficult to train when they become deep.
When reposting, please include the original address of this article: https://kexue.fm/archives/8994
For more detailed reposting matters, please refer to: "Scientific Space FAQ"