Why is the Default Norm for Gradient Clipping 1? · English (unofficial) translations of posts at kexue.fm

As we know, Gradient Clipping is a common technique used to make model training more stable. The most frequently used form of gradient clipping is based on the total norm of the gradients of all parameters. This operation can be expressed as: \begin{equation} \text{clip}(\boldsymbol{g},\tau)=\left\{\begin{aligned}&\boldsymbol{g}, & \|\boldsymbol{g}\|\leq \tau \\ &\frac{\tau}{\|\boldsymbol{g}\|}\boldsymbol{g},& \|\boldsymbol{g}\| > \tau \end{aligned}\right. \end{equation} In this way, \text{clip}(\boldsymbol{g},\tau) maintains the same direction as \boldsymbol{g}, but its norm does not exceed \tau. Note that \|\boldsymbol{g}\| here is the norm calculated by treating all the parameter gradients of the entire model as a single vector, which is the so-called Global Gradient Norm.

I wonder if you have noticed a detail: whether for models with millions of parameters or hundreds of billions of parameters, the value of \tau is often set to 1. What does this mean? Is it simply a matter of reusing a default value, or is there a profound principle hidden behind it?

What is it?

Some readers might think that a default value is not necessarily the optimal value, so why worry about it? Indeed, \tau=1 may not be the optimal choice, but it is the default choice for many models and performs reasonably well under this default setting. This, in turn, suggests that \tau=1 possesses a universal rationality.

What does "rationality" mean here? Let’s go back to the \text{clip} operation. If \|\boldsymbol{g}\| is always smaller than \tau, then \text{clip} degrades into an identity transformation; if \|\boldsymbol{g}\| is always larger than \tau, then \text{clip} degrades into L2 normalization. In other words, the reason \text{clip} functions as \text{clip} is that \tau creates an appropriate level of differentiation, such that most \|\boldsymbol{g}\| are smaller than \tau, and only a small portion are larger than \tau. This is the meaning of the rationality of \tau.

Of course, one can find counterexamples, and quite a few of them. Here, I mainly want to emphasize the prevalence of this phenomenon and the general applicability of this default setting, so meticulous readers need not be overly obsessed with individual details.

Therefore, we believe that the universal rationality of \tau=1 implies that regardless of the number of model parameters, the initialization method, or the choice of loss function, the total gradient norm can happen to use 1 as the boundary point for "outliers." This is undoubtedly an incredible property—this was exactly my feeling when I first realized this conclusion.

Why?

Why is there such a "coincidence"? My answer might be somewhat surprising: because only in this way can the model have the possibility of stable training.

Let’s consider the loss function \mathcal{L}(\boldsymbol{\theta}). If the optimizer’s update rule is \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \boldsymbol{u}_t, then the change in the loss function is approximately: \begin{equation} \Delta \mathcal{L} = \mathcal{L}(\boldsymbol{\theta}_{t+1}) - \mathcal{L}(\boldsymbol{\theta}_t) \approx (\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t)\cdot\nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}) = -\eta\, \boldsymbol{u}_t\cdot \boldsymbol{g}_t \end{equation} First, consider the simplest SGD, where \boldsymbol{u}_t = \boldsymbol{g}_t and \Delta \mathcal{L}=-\eta\|\boldsymbol{g}_t\|^2. That is, the change in the loss function is proportional to the square of the gradient norm. We know that in both CV and NLP, pure SGD (without momentum) is a very inefficient optimizer. In the middle and late stages of training, on average, the loss reduction per step for most tasks is far less than the magnitude of the learning rate, i.e., |\Delta \mathcal{L}| < \eta, which leads to \|\boldsymbol{g}_t\| < 1. This indicates that \|\boldsymbol{g}_t\| < 1 is a long-term characteristic of a model that can converge normally.

Of course, in the early stages of training, it is possible for the model to have \|\boldsymbol{g}_t\| > 1. This is normal, but it is rare to see \|\boldsymbol{g}_t\| \gg 1. Or rather, a good initialization should avoid \|\boldsymbol{g}_t\| \gg 1. The theoretical basis for methods like DeepNorm is based on this. The reason is similar: if the gradient norm is too large, the early learning will be too "aggressive," leading to premature convergence to a poor local solution. Another solution is to reduce \eta, which also reduces |\Delta \mathcal{L}|; this is why we usually use Warmup in the early stages of training.

By the way, for an understanding of Warmup, you can refer to the paper "Optimal Linear Decay Learning Rate Schedules and Further Refinements", which I believe provides the most reasonable analysis of Warmup.

What to do?

Simply put, because the change in the loss function is proportional to the square of the gradient norm, the stability of training dictates that the gradient norm cannot be too large and should behave as less than 1 in the long run. If a gradient norm significantly larger than 1 appears in the early stages, the usual strategy is Warmup. Alternatively, one could consider a more general strategy: set another threshold \mathcal{T} and clip \eta based on the value of \boldsymbol{u}_t\cdot \boldsymbol{g}_t: \begin{equation} \eta_t = \left\{\begin{aligned}&\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t\leq \mathcal{T} \\ &\frac{\mathcal{T}}{\boldsymbol{u}_t\cdot \boldsymbol{g}_t}\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t > \mathcal{T} \end{aligned}\right. \end{equation} This eliminates the need for extra Warmup settings and is more adaptive.

For optimizers like Adam, we can perform an approximate analysis using \boldsymbol{u}_t=\text{sign}(\boldsymbol{g}_t), similar to the post "How Should the Learning Rate Change as the Batch Size Increases?". In this case: \begin{equation} \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \|\boldsymbol{g}_t\|_1 \end{equation} Here \|\cdot\|_1 is the L1 norm, i.e., the sum of the absolute values of the components. Since gradient components are generally less than 1, \|\boldsymbol{g}_t\|_1 \gg \|\boldsymbol{g}_t\|. Therefore, also due to the requirement for stable training, the learning rate for Adam is usually significantly smaller than that for SGD. Furthermore, the above equation can be rewritten as: \begin{equation} \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \sqrt{N}\|\boldsymbol{g}_t\| \cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t) \end{equation} Here we assume that \boldsymbol{g}_t has no zero components, so \|\text{sign}(\boldsymbol{g}_t)\|=\sqrt{N}, where N is the total number of model parameters. Practice has found that \|\boldsymbol{g}_t\| and \cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t) are roughly constant across different model scales. Therefore, to keep \Delta \mathcal{L} constant, \eta should be inversely proportional to \sqrt{N}. That is, if the number of model parameters increases by 4 times, the learning rate could be considered for halving.

Conclusion

This article has presented some personal views and reflections on the phenomenon that "the default norm for gradient clipping is 1."