How Should the Learning Rate Change as the Batch Size Increases? · English (unofficial) translations of posts at kexue.fm

With the rapid advancement of computing power, more and more scenarios aim to achieve “trading compute for time”—that is, shortening model training time by stacking computing resources. Ideally, we hope that if we invest n times the computing power, the time to achieve the same result will be shortened to 1/n, keeping the total computing cost consistent. This “hope” seems reasonable and natural, but in reality, it is far from trivial. Even if we ignore bottlenecks like communication, when computing power exceeds a certain scale or the model is smaller than a certain scale, increasing compute often only allows for increasing the Batch Size. However, does increasing the Batch Size necessarily shorten training time while maintaining the same performance?

This is the topic we will discuss next: when the Batch Size increases, how should various hyperparameters, especially the learning rate, be adjusted to maintain the original training effect and maximize training efficiency? We can also call this the Scaling Law between Batch Size and learning rate.

The Variance Perspective

Intuitively, when the Batch Size increases, the gradient of each batch will be more accurate, so we can take larger steps—that is, increase the learning rate—to reach the destination faster and shorten training time. Most people can grasp this point. The question is: how much of an increase is most appropriate?

Square Root Scaling

The earliest answer to this question might be square root scaling, i.e., if the Batch Size is expanded by n times, the learning rate is expanded by \sqrt{n} times. This comes from the 2014 paper “One weird trick for parallelizing convolutional neural networks”. The derivation principle is to keep the variance of the SGD increment constant. Specifically, let the gradient of a randomly sampled instance be \tilde{\boldsymbol{g}}, with mean \boldsymbol{g} and covariance \boldsymbol{\Sigma}, where \boldsymbol{g} is the gradient of the entire population. When we increase the number of samples to B, we have: \begin{equation} \tilde{\boldsymbol{g}}_B \triangleq \frac{1}{B}\sum_{i=1}^B \tilde{\boldsymbol{g}}^{(i)},\quad \mathbb{E}[\tilde{\boldsymbol{g}}_B] = \boldsymbol{g},\quad \mathbb{E}[(\tilde{\boldsymbol{g}}_B-\boldsymbol{g})(\tilde{\boldsymbol{g}}_B-\boldsymbol{g})^{\top}]=\frac{\boldsymbol{\Sigma}}{B} \end{equation} That is, increasing the number of samples does not change the mean, while the covariance is reduced to 1/B. For the SGD optimizer, the increment is -\eta \tilde{\boldsymbol{g}}_B, and its covariance is proportional to \eta^2/B. Assuming that a moderate amount of noise (neither too much nor too little) is necessary during the optimization process, when the Batch Size B changes, we adjust the learning rate \eta to keep the noise intensity of the increment (the covariance matrix) constant, leading to: \begin{equation} \frac{\eta^2}{B} = \text{constant} \quad\Rightarrow\quad \eta\propto \sqrt{B} \end{equation} This yields the square root scaling law for the learning rate and Batch Size. The later paper “Train longer, generalize better: closing the generalization gap in large batch training of neural networks” also agrees with this choice.

Linear Scaling

Interestingly, linear scaling (i.e., \eta\propto B) often performs better in practice. Even the author of “One weird trick for parallelizing convolutional neural networks”, who first proposed square root scaling, pointed this out in his paper and stated that he could not provide a reasonable explanation for it.

To some extent, linear scaling is more consistent with our intuitive understanding. Especially as in “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, if we assume that the gradient direction of n consecutive batches does not change much, then linear scaling is almost obviously true. However, this assumption is clearly too strong. Relaxing this assumption requires connecting SGD with SDEs (Stochastic Differential Equations), which was accomplished by “Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations”. However, the first paper to use this to point out the scaling relationship between learning rate and Batch Size is likely “On the Generalization Benefit of Noise in Stochastic Gradient Descent”.

In hindsight, establishing this connection is not difficult to understand. Let the model parameters be \boldsymbol{w}; then the SGD update rule can be rewritten as: \begin{equation} \boldsymbol{w}_{t+1} =\boldsymbol{w}_t - \eta \tilde{\boldsymbol{g}}_{B,t} =\boldsymbol{w}_t - \eta \boldsymbol{g}_t - \eta (\tilde{\boldsymbol{g}}_{B,t} - \boldsymbol{g}_t) \end{equation} where \tilde{\boldsymbol{g}}_{B,t} - \boldsymbol{g}_t is the gradient noise. So far, we have made no assumptions about the distribution of this noise, only that its mean is \boldsymbol{0} and its covariance is \boldsymbol{\Sigma}_t/B. Next, we assume that the distribution of this noise is a normal distribution \mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma}_t/B). Then the above iteration can be further rewritten as: \begin{equation} \begin{aligned} \boldsymbol{w}_{t+1} =&\, \boldsymbol{w}_t - \eta \boldsymbol{g}_t - \eta (\tilde{\boldsymbol{g}}_{B,t} - \boldsymbol{g}_t) \\[5pt] =&\, \boldsymbol{w}_t - \eta \boldsymbol{g}_t - \eta \sqrt{\frac{\boldsymbol{\Sigma}_t}{B}}\boldsymbol{z},\quad \boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \\[5pt] =&\, \boldsymbol{w}_t - \eta \boldsymbol{g}_t - \sqrt{\eta} \sqrt{\frac{\eta\boldsymbol{\Sigma}_t}{B}}\boldsymbol{z},\quad \boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \end{aligned} \end{equation} This means that the SGD iteration format \boldsymbol{w}_{t+1} =\boldsymbol{w}_t - \eta \tilde{\boldsymbol{g}}_{B,t} is actually approximately solving the SDE: \begin{equation} d\boldsymbol{w} = - \boldsymbol{g}_t dt - \sqrt{\frac{\eta\boldsymbol{\Sigma}_t}{B}}d\boldsymbol{z},\quad d\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},dt\boldsymbol{I}) \end{equation} Therefore, for the results to remain significantly unchanged when B varies, the form of the above SDE should remain constant, which leads to linear scaling \eta\propto B. The most critical step in this process is that the step size of the noise term in the SDE is the square root of the non-noise term, thereby separating a factor of \sqrt{\eta}. We also commented on this in “General Framework of Diffusion Models: SDE Edition”. Simply put, zero-mean Gaussian noise has a certain cancellation effect over the long term, so the step size must be increased to manifest the noise effect.

The above conclusions are based on the SGD optimizer. The paper “On the SDEs and Scaling Rules for Adaptive Gradient Algorithms” generalized this to optimizers like RMSProp and Adam, resulting in square root scaling. Coincidentally, the slightly earlier “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” also applied square root scaling when testing Adam and its variant LAMB. For more content, one can also refer to the blog “How to Scale Hyperparameters as Batch Size Increases”.

Facing the Loss Directly

It is certain that whether it is square root scaling or linear scaling, they can only hold approximately within a local range, because they both imply the conclusion that “as long as the Batch Size is large enough, the learning rate can be arbitrarily large,” which is clearly impossible. Furthermore, the work in the previous two sections centered around variance, but our fundamental task is to reduce the loss function. Therefore, a loss-function-oriented approach might be more essential.

Monotonic and Bounded

A classic work from this perspective is OpenAI’s “An Empirical Model of Large-Batch Training”. It analyzes the optimal learning rate of SGD through a second-order approximation of the loss function, concluding that “the learning rate increases monotonically with the Batch Size but has an upper bound.” The same idea appeared in the slightly earlier “Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients”, although that paper was not used to analyze the effect of Batch Size.

The most critical idea in the entire derivation process is to treat the learning rate as an optimization parameter: let the loss function be \mathcal{L}(\boldsymbol{w}) and the gradient of the current batch be \tilde{\boldsymbol{g}}_B. Then the loss function after SGD is \mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B). We treat the solving of the optimal learning rate as an optimization problem: \begin{equation} \eta^* = \mathop{\text{argmin}}_{\eta} \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B)] \end{equation} This objective is quite intuitive: choose the learning rate such that on average the training efficiency is highest (the loss function decreases fastest). To solve this problem, we expand the loss function to the second order: \begin{equation} \mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B) \approx \mathcal{L}(\boldsymbol{w}) - \eta\tilde{\boldsymbol{g}}_B^{\top}\underbrace{\frac{\partial \mathcal{L}(\boldsymbol{w})}{\partial\boldsymbol{w}}}_{\text{which is }\boldsymbol{g}} + \frac{1}{2}\eta^2 \tilde{\boldsymbol{g}}_B^{\top}\underbrace{\frac{\partial^2 \mathcal{L}(\boldsymbol{w})}{\partial\boldsymbol{w}^2}}_{\text{denoted as }\boldsymbol{H}}\tilde{\boldsymbol{g}}_B = \mathcal{L}(\boldsymbol{w}) - \eta\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B \end{equation} Here \boldsymbol{H} is the Hessian matrix, and \frac{\partial \mathcal{L}(\boldsymbol{w})}{\partial\boldsymbol{w}} is the gradient of the loss function. The ideal objective function is based on all samples, which is why its gradient is the mean of \tilde{\boldsymbol{g}}_B, denoted as \boldsymbol{g}. Taking the expectation, we get: \begin{equation} \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{g}}_B)] \approx \mathbb{E}[\mathcal{L}(\boldsymbol{w}) - \eta\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B] = \mathcal{L}(\boldsymbol{w}) - \eta\boldsymbol{g}^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \mathbb{E}[\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B] \end{equation} The last term involves a small trick: \begin{equation} \begin{aligned} \mathbb{E}[\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B] =&\, \mathbb{E}[\mathop{\mathrm{tr}}(\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{g}}_B)]= \mathbb{E}[\mathop{\mathrm{tr}}(\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}\boldsymbol{H})] = \mathop{\mathrm{tr}}(\mathbb{E}[\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}]\boldsymbol{H})\\[5pt] =&\, \mathop{\mathrm{tr}}((\boldsymbol{g}\boldsymbol{g}^{\top} + \boldsymbol{\Sigma}/B)\boldsymbol{H}) = \boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g} + \mathop{\mathrm{tr}}(\boldsymbol{\Sigma}\boldsymbol{H})/B \end{aligned} \end{equation} The transformation mainly utilizes \mathop{\mathrm{tr}}(\boldsymbol{A}\boldsymbol{B}) = \mathop{\mathrm{tr}}(\boldsymbol{B}\boldsymbol{A}). Now, assuming the positive definiteness of \boldsymbol{H}, the problem becomes finding the minimum of a quadratic function, which is easily solved as: \begin{equation} \eta^* \approx \frac{\boldsymbol{g}^{\top}\boldsymbol{g}}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g} + \mathop{\mathrm{tr}}(\boldsymbol{\Sigma}\boldsymbol{H})/B} = \frac{\eta_{\max}}{1 + \mathcal{B}_{\text{noise}}/B}\label{eq:eta-opt} \end{equation} This yields the result that \eta^* increases monotonically with B and has an upper bound, where: \begin{equation} \eta_{\max} = \frac{\boldsymbol{g}^{\top}\boldsymbol{g}}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}},\qquad\mathcal{B}_{\text{noise}} = \frac{\mathop{\mathrm{tr}}(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}} \end{equation}

Practical Analysis

When B \ll \mathcal{B}_{\text{noise}}, 1 + \mathcal{B}_{\text{noise}}/B\approx \mathcal{B}_{\text{noise}}/B, so \eta^* \approx \eta_{\max}B/\mathcal{B}_{\text{noise}}\propto B, which is linear scaling. This again shows that linear scaling is only a local approximation for small Batch Sizes. When B > \mathcal{B}_{\text{noise}}, \eta^* gradually approaches the saturation value \eta_{\max}, meaning that the increase in training cost far outweighs the improvement in training efficiency. Therefore, \mathcal{B}_{\text{noise}} acts as a watershed; when the Batch Size exceeds this value, there is no need to continue investing compute to increase the Batch Size.

For practice, the most critical question is how to estimate \eta_{\max} and \mathcal{B}_{\text{noise}}. In particular, \mathcal{B}_{\text{noise}} directly relates to the scaling law of the learning rate and the saturation of training efficiency. Direct calculation of both involves the Hessian matrix \boldsymbol{H}, whose computational cost is proportional to the square of the number of parameters. In today’s era where models with hundreds of millions of parameters are considered small, calculating the Hessian matrix is clearly unrealistic. Thus, more efficient calculation methods must be found.

Let’s look at \mathcal{B}_{\text{noise}} = \frac{\mathop{\mathrm{tr}}(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}}. Both the numerator and denominator contain \boldsymbol{H}, which tempts us to “cancel” them out. Indeed, the simplification idea follows this: assuming \boldsymbol{H} is approximately a multiple of the identity matrix, we get: \begin{equation} \mathcal{B}_{\text{noise}} = \frac{\mathop{\mathrm{tr}}(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}}\approx \frac{\mathop{\mathrm{tr}}(\boldsymbol{\Sigma})}{\boldsymbol{g}^{\top}\boldsymbol{g}}\triangleq \mathcal{B}_{\text{simple}} \end{equation} \mathcal{B}_{\text{simple}} is more computationally feasible, and experiments have found it to be a good approximation of \mathcal{B}_{\text{noise}}. Therefore, we choose to estimate \mathcal{B}_{\text{simple}} instead of \mathcal{B}_{\text{noise}}. Note that \mathop{\mathrm{tr}}(\boldsymbol{\Sigma}) only requires the diagonal elements, so we don’t need to compute the full covariance matrix; we only need to calculate the variance of each gradient component individually and sum them up. In data-parallel scenarios, the gradient variance can be estimated directly using the gradients calculated on each device.

It should be noted that results like Eq. [eq:eta-opt] are actually dynamic, meaning that theoretically \eta_{\max}, \mathcal{B}_{\text{noise}}, and \mathcal{B}_{\text{simple}} are different at each training step. Therefore, if we want a static law, we need to train for a period until the model training is “on track” before the calculated \mathcal{B}_{\text{simple}} is reliable. Alternatively, we can continuously monitor \mathcal{B}_{\text{simple}} during training to judge the gap between the current setting and the optimum.

As for \eta_{\max}, there is no need to estimate it using the formula. One can simply perform a grid search for the learning rate at a small Batch Size to find an approximate \eta^*, and then combine it with the estimated \mathcal{B}_{\text{simple}} to back-calculate \eta_{\max}.

Data Efficiency

From the above results, we can also derive an asymptotic relationship between the amount of training data and the number of training steps. The derivation is simple: substituting [eq:eta-opt] into the loss function, we can calculate that under the optimal learning rate, the reduction in the loss function per iteration is: \begin{equation} \overline{\Delta\mathcal{L}} = \mathcal{L}(\boldsymbol{w}) - \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta^*\tilde{\boldsymbol{g}}_B)] \approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise}}/B}\label{eq:Delta-L-sgd} \end{equation} where \Delta\mathcal{L}_{\max} = \frac{(\boldsymbol{g}^{\top}\boldsymbol{g})^2}{2\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}}. The key is the interpretation of this result.

When B\to\infty (full-batch SGD), the loss reduction per step reaches the maximum \Delta\mathcal{L}_{\max}. At this point, the target can be reached with the minimum number of training steps (denoted as S_{\min}). When B is finite, the average loss reduction per step is only \overline{\Delta\mathcal{L}}, which means we need 1 + \mathcal{B}_{\text{noise}}/B steps to achieve the reduction of a single full-batch SGD step. Thus, the total number of training steps is roughly S = (1 + \mathcal{B}_{\text{noise}}/B)S_{\min}.

Since the Batch Size is B, the total number of samples consumed during training is E = BS = (B + \mathcal{B}_{\text{noise}})S_{\min}. This is an increasing function of B, and as B\to 0, E_{\min} = \mathcal{B}_{\text{noise}}S_{\min}. This indicates that as long as we use a small enough Batch Size to train the model, the total number of training samples E required will decrease accordingly, at the cost of a very large number of training steps S. Furthermore, using these notations, we can write their relationship as: \begin{equation} \left(\frac{S}{S_{\min}} - 1\right)\left(\frac{E}{E_{\min}} - 1\right) = 1\label{eq:E-S} \end{equation} This is the scaling law between training data volume and training steps, indicating that if the data volume is small, the Batch Size should be reduced to allow for more training steps, thereby increasing the chance of reaching a better solution. The derivation here is simplified by the author, assuming the invariance of \mathcal{B}_{\text{noise}} and \Delta\mathcal{L}_{\max} throughout the training process. If necessary, one can follow the original paper’s appendix to handle dynamic changes more precisely using integration (though this requires introducing the assumption B = \sqrt{r\mathcal{B}_{\text{noise}}}).

Additionally, since \mathcal{B}_{\text{noise}} = E_{\min}/S_{\min}, the above equation also provides another scheme for estimating \mathcal{B}_{\text{noise}}: by obtaining multiple (S,E) pairs through multiple experiments and grid searches, and then fitting the above equation to estimate E_{\min} and S_{\min}, and subsequently calculating \mathcal{B}_{\text{noise}}.

Adaptive Version

It must be said that OpenAI is truly one of the pioneers of various Scaling Laws. The previous analysis is quite brilliant, and the results are rich. More importantly, the entire derivation process is not complex, giving a sense of the simplicity of essence. However, the current conclusions are based on SGD. Their applicability to adaptive learning rate optimizers like Adam was unclear until the work in “Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling”.

Symbolic Approximation

The idea for analyzing Adam is the same as for SGD, based on second-order expansion. The difference is that the direction vector is changed from \tilde{\boldsymbol{g}}_B to a general vector \tilde{\boldsymbol{\varphi}}_B. We have: \begin{equation} \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta\tilde{\boldsymbol{\varphi}}_B)] \approx \mathcal{L}(\boldsymbol{w}) - \eta\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \mathop{\mathrm{tr}}(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H}) \end{equation} Now we need to determine \tilde{\boldsymbol{\varphi}}_B and calculate the corresponding \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] and \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]. Since we only need an asymptotic relationship, similar to “Can LoRA Gain a Bit More with Different Learning Rates?”, we choose SignSGD, i.e., \tilde{\boldsymbol{\varphi}}_B = \mathop{\mathrm{sign}}(\tilde{\boldsymbol{g}}_B), as an approximation for Adam. This approach likely originated in “Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients”. The rationality of this approximation lies in two points:

1. Regardless of the values of \beta_1, \beta_2, the update vector of the first step of Adam is \mathop{\mathrm{sign}}(\tilde{\boldsymbol{g}}_B);

2. When \beta_1=\beta_2=0, the update vector of Adam is always \mathop{\mathrm{sign}}(\tilde{\boldsymbol{g}}_B).

To calculate \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] and \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}], we assume, as in the “Linear Scaling” section, that \tilde{\boldsymbol{g}}_B follows the distribution \mathcal{N}(\boldsymbol{g},\boldsymbol{\Sigma}/B). To simplify calculations, we further assume that \boldsymbol{\Sigma} is a diagonal matrix \mathop{\mathrm{diag}}(\sigma_1^2,\sigma_2^2,\sigma_3^2,\cdots), assuming the components are independent. Thus, we can handle each component independently. Using the reparameterization trick, we know \tilde{g}_B\sim \mathcal{N}(g, \sigma^2/B) is equivalent to \tilde{g}_B=g + \sigma z/\sqrt{B}, z\sim\mathcal{N}(0,1). Therefore: \begin{equation} \begin{aligned} \mathbb{E}[\tilde{\varphi}_B] =&\, \mathbb{E}[\mathop{\mathrm{sign}}(g + \sigma z/\sqrt{B})] = \mathbb{E}[\mathop{\mathrm{sign}}(g\sqrt{B}/\sigma + z)] \\[5pt] =&\, \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} \mathop{\mathrm{sign}}(g\sqrt{B}/\sigma + z) e^{-z^2/2}dz \\[5pt] =&\, \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{-g\sqrt{B}/\sigma} (-1)\times e^{-z^2/2}dz + \frac{1}{\sqrt{2\pi}}\int_{-g\sqrt{B}/\sigma}^{\infty} 1\times e^{-z^2/2}dz \\[5pt] =&\, \mathop{\mathrm{erf}}\left(\frac{g}{\sigma}\sqrt{\frac{B}{2}}\right) \end{aligned} \end{equation} Here \mathop{\mathrm{erf}} is the error function, an S-shaped function with a range of (-1,1) similar to \tanh, which can serve as a smooth approximation of \mathop{\mathrm{sign}}. Since \mathop{\mathrm{erf}} itself has no elementary function expression, we look for an elementary approximation to observe the laws more intuitively. We previously discussed this in “How the Two Elementary Function Approximations of GELU Came About”, but those were too complex. Here we use a simpler one: \begin{equation} \mathop{\mathrm{erf}}(x)\approx \mathop{\mathrm{sign}}(x) = \frac{x}{|x|} = \frac{x}{\sqrt{x^2}}\approx \frac{x}{\sqrt{x^2+c}} \end{equation} We choose c=\pi/4 so that the first-order approximation at x=0 equals that of \mathop{\mathrm{erf}}. Given the many approximations already made, the exact value of c is not very important; we just need to know such a c > 0 exists. Based on this: \begin{equation} \mathbb{E}[\tilde{\varphi}_B] \approx \frac{g/\sigma}{\sqrt{\pi/2B+(g/\sigma)^2}}\quad\Rightarrow\quad\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i \approx \frac{g_i/\sigma_i}{\sqrt{\pi/2B+(g_i/\sigma_i)^2}}\triangleq \mu_i \end{equation} One obvious difference between Adam and SGD is that \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] is already related to B. Fortunately, the second moment is simpler because the square of \mathop{\mathrm{sign}}(x) is always 1: \begin{equation} \mathbb{E}[\tilde{\varphi}_B^2] = 1\quad\Rightarrow\quad\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j} \to\left\{\begin{aligned}&=1, & i = j \\ &\approx\mu_i \mu_j,&\,i\neq j\end{aligned}\right. \end{equation} Using these results, we obtain: \begin{gather} \eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\mathop{\mathrm{tr}}(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i} + \sum_{i\neq j} \mu_i \mu_j H_{i,j}}\label{eq:eta-opt-sign} \\[5pt] \overline{\Delta\mathcal{L}} = \mathcal{L}(\boldsymbol{w}) - \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta^*\tilde{\boldsymbol{\varphi}}_B)] \approx \frac{1}{2}\frac{(\sum_i \mu_i g_i)^2}{\sum_i H_{i,i} + \sum_{i\neq j} \mu_i \mu_j H_{i,j}}\label{eq:Delta-L-sign} \end{gather}

Two Special Cases

Compared to SGD’s Eq. [eq:eta-opt], Adam’s Eq. [eq:eta-opt-sign] is more complex, making it hard to see the dependence on B directly. We look at special cases.

First, consider B\to\infty, where \mu_i = \mathop{\mathrm{sign}}(g_i), so: \begin{equation} \eta^* \approx \frac{\sum_i |g_i|}{\sum_i H_{i,i} + \sum_{i\neq j} \mathop{\mathrm{sign}}(g_i g_j) H_{i,j}} \end{equation} The difference from SGD’s \eta_{\max} is that it is not homogeneous with respect to the gradient but proportional to the gradient’s scale.

Next, consider the case where \boldsymbol{H} is diagonal (H_{i,j}=0 for i\neq j): \begin{equation} \eta^* \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i}}=\frac{1}{\sum_i H_{i,i}}\sum_i \frac{g_i^2/\sigma_i}{\sqrt{\pi/2B+(g_i/\sigma_i)^2}} \end{equation} Each term in the sum is monotonically increasing and bounded with respect to B, so the total result is as well. To capture the essence, we simplify \mu_i further (diverging from the original paper): \begin{equation} \mu_i = \frac{g_i/\sigma_i}{\sqrt{\pi/2B+(g_i/\sigma_i)^2}} = \frac{\mathop{\mathrm{sign}}(g_i)}{\sqrt{1 + \pi(\sigma_i/g_i)^2/2B}} \approx \frac{\mathop{\mathrm{sign}}(g_i)}{\sqrt{1 + \pi\kappa^2/2B}}\label{eq:mu-approx} \end{equation} Assuming there exists a constant \kappa^2 independent of i (e.g., a mean of (\sigma_i/g_i)^2), then: \begin{equation} \eta^* \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i}}\approx \frac{\sum_i |g_i|}{\sum_i H_{i,i}}\frac{1}{\sqrt{1 + \pi\kappa^2/2B}}\label{eq:eta-opt-sign-diag} \end{equation} When \pi\kappa^2\gg 2B, we have: \begin{equation} \eta^* \approx \frac{\sum_i |g_i|}{\kappa\sum_i H_{i,i}}\sqrt{\frac{2B}{\pi}} \propto \sqrt{B} \end{equation} This shows that for small Batch Sizes, Adam indeed follows the square root scaling law.

Surge Phenomenon

Applying approximation [eq:mu-approx] to the original Eq. [eq:eta-opt-sign] reveals new characteristics: \begin{equation} \eta^* \approx \frac{\sum_i \mu_i g_i}{\sum_i H_{i,i} + \sum_{i\neq j} \mu_i \mu_j H_{i,j}} \approx \frac{\eta_{\max}}{\frac{1}{2}\left(\frac{\beta_{\text{noise}}}{\beta} + \frac{\beta}{\beta_{\text{noise}}}\right)}\label{eq:eta-opt-beta} \end{equation} where \beta = (1 + \pi\kappa^2/2B)^{-1/2}, and: \begin{equation} \beta_{\text{noise}} = \sqrt{\frac{\sum_i H_{i,i}}{\sum_{i\neq j}\mathop{\mathrm{sign}}(g_i g_j) H_{i,j}}},\quad \eta_{\max} = \frac{\sum_i |g_i|}{2\sqrt{\left(\sum_i H_{i,i}\right)\left(\sum_{i\neq j} \mathop{\mathrm{sign}}(g_i g_j) H_{i,j}\right)}} \end{equation} Note that \beta is a monotonic function of B, but Eq. [eq:eta-opt-beta] is not monotonic with respect to \beta; it increases then decreases, peaking at \beta=\beta_{\text{noise}}. This implies a corresponding \mathcal{B}_{\text{noise}} such that when the Batch Size exceeds it, the optimal learning rate should decrease rather than increase! This is the “Surge phenomenon.” (This assumes \beta_{\text{noise}} < 1; if \beta_{\text{noise}} \geq 1, the relationship remains monotonic.)

Regarding Adam’s \eta^*, OpenAI’s paper appendix “guessed” it should be: \begin{equation} \eta^* \approx \frac{\eta_{\max}}{(1 + \mathcal{B}_{\text{noise}}/B)^{\alpha}}\label{eq:openai-adam} \end{equation} where 0.5 < \alpha < 1. It now appears this form is an approximation when diagonal elements of the Hessian dominate. When off-diagonal elements are non-negligible, the Surge phenomenon can emerge.

How to understand the Surge phenomenon? It essentially reflects the sub-optimality of adaptive learning rate strategies. Taking \tilde{\boldsymbol{\varphi}}_B = \mathop{\mathrm{sign}}(\tilde{\boldsymbol{g}}_B) as an example, as B\to \infty, the direction becomes \mathop{\mathrm{sign}}(\boldsymbol{g}). Is \mathop{\mathrm{sign}}(\boldsymbol{g}) the most scientific update direction? Not necessarily, especially in late training. Thus, a certain amount of noise in \mathop{\mathrm{sign}}(\tilde{\boldsymbol{g}}_B) at appropriate B might correct this sub-optimality. As B increases and noise decreases, this correction opportunity is lost, requiring a more cautious (lower) learning rate.

Efficiency Relationship

As with SGD, we consider \overline{\Delta\mathcal{L}}. Substituting [eq:eta-opt-beta] into [eq:Delta-L-sign] and simplifying: \begin{equation} \overline{\Delta\mathcal{L}} \approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise-2}}/B}\label{eq:Delta-L-sign-2} \end{equation} where: \begin{equation} \Delta\mathcal{L}_{\max} = \frac{\beta_{\text{noise}}\eta_{\max}\sum_i|g_i|}{1 + \beta_{\text{noise}}^2},\quad \mathcal{B}_{\text{noise-2}} = \frac{\pi\kappa^2\beta_{\text{noise}}^2}{2(1 + \beta_{\text{noise}}^2)}\label{eq:beta-B-noise} \end{equation} Here \mathcal{B}_{\text{noise-2}} is a new notation, not the theoretical optimal Batch Size \mathcal{B}_{\text{noise}} solved from \beta=\beta_{\text{noise}}, which is: \begin{equation} \mathcal{B}_{\text{noise}} = \frac{\pi\kappa^2\beta_{\text{noise}}^2}{2(1 - \beta_{\text{noise}}^2)} \end{equation} Their relationship is: \begin{equation} \frac{1}{\mathcal{B}_{\text{noise-2}}} - \frac{1}{\mathcal{B}_{\text{noise}}} = \frac{4}{\pi\kappa^2}\quad\Rightarrow\quad \mathcal{B}_{\text{noise}} = \left(\frac{1}{\mathcal{B}_{\text{noise-2}}} - \frac{4}{\pi\kappa^2}\right)^{-1}\label{eq:B-1-2} \end{equation} Since Eq. [eq:Delta-L-sign-2] has the same form as SGD’s Eq. [eq:Delta-L-sgd], the same analysis applies, leading to Eq. [eq:E-S]: \begin{equation} \left(\frac{S}{S_{\min}} - 1\right)\left(\frac{E}{E_{\min}} - 1\right) = 1 \end{equation} where E_{\min}/S_{\min} = \mathcal{B}_{\text{noise-2}}. This provides a way to estimate \beta_{\text{noise}} and \mathcal{B}_{\text{noise}} by fitting (S,E) pairs and estimating \kappa^2.

Supplementary Notes

The starting points and conclusions here are similar to the original paper “Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling”, but the approximation process differs. The original paper’s conclusions mostly assume B \ll \pi(\sigma_i/g_i)^2/2, leading to the conclusion that the Surge phenomenon almost always occurs, which is somewhat unscientific as the assumption is component-dependent.

This article introduces approximation [eq:mu-approx], which can be seen as a mean-field approximation, more reasonable than point-wise assumptions. It allows for the conclusion that even if off-diagonal Hessian elements are non-negligible, the Surge phenomenon might not occur (depending on \beta_{\text{noise}}). This precision does not sacrifice simplicity.

Finally, OpenAI’s SGD analysis was from 2018, while the Surge phenomenon paper was released this year. It is surprising that it took six years to move from SGD to Adam, likely due to OpenAI’s authority and the guess in [eq:openai-adam] making people feel Adam was “solved.”

Summary

This article discussed the Scaling Law between Batch Size and learning rate from multiple perspectives, focusing on OpenAI’s derivation based on second-order loss approximation and subsequent work on the Adam optimizer.

Reprinting should include the original address: https://kexue.fm/archives/10542

For more details on reprinting, please refer to: “Scientific Space FAQ”