English (unofficial) translations of posts at kexue.fm
Source

Are the Optimal Hyperparameters for the Adam Optimizer $ _1 = _2$?

Translated by DeepSeek V4 Pro. Translations can be inaccurate, please refer to the original post for important stuff.

Recently, I came across the paper "Why Adam Works Better with \beta_1=\beta_2: The Missing Gradient Scale Invariance Principle". As the name suggests, it claims that Adam performs better when \beta_1=\beta_2. A colleague reminded me that last year’s paper "In Search of Adam’s Secret Sauce" also expressed the same view. Coincidentally, "The Effect of Mini-Batch Noise on the Implicit Bias of Adam", which was released just yesterday, also has similar findings.

\text{Adam}\color{skyblue}{\text{W}}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t / (1 - \beta_1^t)\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t / (1 - \beta_2^t)\\ &\boldsymbol{u}_t = \hat{\boldsymbol{m}}_t / (\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon)\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t (\boldsymbol{u}_t \color{skyblue}{ + \lambda_t \boldsymbol{\theta}_{t-1}}) \end{aligned}\right.

Numerous papers point towards \beta_1=\beta_2. What are its theoretical advantages? In this article, we will study the relevant derivations.

Online Estimation

Following the chronological order, let’s first look at "In Search of Adam’s Secret Sauce". The storyline of this paper seems to be that through experiments, it was discovered that the optimal solution for Adam when \beta_1=\beta_2 is very close to the optimal solution without constraints. Then, it attempts to construct a theoretical explanation for this result: when \beta_1=\beta_2=\beta, \hat{\boldsymbol{m}}_t and \hat{\boldsymbol{v}}_t can be regarded as online estimates of the first and second moments of the gradient. Specifically, by expanding \hat{\boldsymbol{m}}_t and \hat{\boldsymbol{v}}_t, we get:

\hat{\boldsymbol{m}}_t = \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} \boldsymbol{g}_k,\qquad \hat{\boldsymbol{v}}_t = \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} \boldsymbol{g}_k^2

It is easy to prove that the sum of the coefficients \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} is always equal to 1. Therefore, they are the same weighted average of \boldsymbol{g}_t and \boldsymbol{g}_t^2, respectively, and thus possess the meaning of the first and second moments. Furthermore, we have:

\begin{aligned} \hat{\boldsymbol{v}}_t =&\, \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} (\hat{\boldsymbol{m}}_t + \boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 \\ =&\, \underbrace{\frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} \hat{\boldsymbol{m}}_t^2}_{\hat{\boldsymbol{m}}_t^2} + \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 + \underbrace{\frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} 2\hat{\boldsymbol{m}}_t (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)}_{\boldsymbol{0}} \\ =&\, \hat{\boldsymbol{m}}_t^2 + \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 \\ \end{aligned}

The last term clearly has the form of variance, so we might as well denote it as \hat{\boldsymbol{\sigma}}_t^2, i.e., \hat{\boldsymbol{v}}_t=\hat{\boldsymbol{m}}_t^2+\hat{\boldsymbol{\sigma}}_t^2. This is exactly the relationship between the second moment, the mean, and the variance. Note that the operations involving addition, subtraction, multiplication, division, and exponentiation of vectors mentioned above are all in the element-wise sense; that is, they are performed component by component, and the output is still a vector.

Signal-to-Noise Awareness

Under these new notations, the update amount of Adam can be written as (for simplicity, assume \epsilon=0):

\boldsymbol{u}_t = \frac{\hat{\boldsymbol{m}}_t}{\sqrt{\hat{\boldsymbol{v}}_t}} = \frac{\hat{\boldsymbol{m}}_t}{\sqrt{\hat{\boldsymbol{m}}_t^2+\hat{\boldsymbol{\sigma}}_t^2}} = \frac{\mathop{\mathrm{sign}}(\hat{\boldsymbol{m}}_t)}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}}

This form of update has several advantages. First and most obviously, each of its components is bounded, constrained within [-1, 1], so we don’t have to worry about the update amount exploding. Secondly, \hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2 is exactly in the form of the reciprocal of the signal-to-noise ratio (SNR), so it can also be understood as "SNR-aware steepest descent."

According to our derivation in "Steepest Descent on Manifolds: 1. SGD + Hypersphere", \mathop{\mathrm{sign}}(\hat{\boldsymbol{m}}_t) can be seen as the solution to the following optimization problem:

\max_{\boldsymbol{u}} \langle\hat{\boldsymbol{m}}_t,\boldsymbol{u}\rangle\qquad \text{s.t.}\qquad \Vert\boldsymbol{u}\Vert_{\infty} = 1

where \Vert\boldsymbol{u}\Vert_{\infty} = 1 means that the maximum absolute value of the components of \boldsymbol{u} is 1. If we consider \hat{\boldsymbol{m}}_t as a more accurate gradient, then \mathop{\mathrm{sign}}(\hat{\boldsymbol{m}}_t) is the steepest descent direction under the infinity norm. But now this bound is static. We can reasonably believe that if the gradient fluctuation is small (high SNR), then the region is relatively flat, and the update amount can be appropriately increased; otherwise, it should be reduced. Therefore, constructing a dynamic boundary \frac{1}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}} for each component based on the SNR better reflects the effect of adaptive learning. At this time, the steepest descent direction is exactly:

\max_{\boldsymbol{u}} \langle\hat{\boldsymbol{m}}_t,\boldsymbol{u}\rangle\quad \text{s.t.}\quad |\boldsymbol{u}| \leq \frac{1}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}} \qquad\Rightarrow\qquad \boldsymbol{u}^* = \frac{\mathop{\mathrm{sign}}(\hat{\boldsymbol{m}}_t)}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}}

Here, the absolute value | | and the inequality \leq are both element-wise.

First-order Expansion

Now let’s look at "Why Adam Works Better with \beta_1=\beta_2: The Missing Gradient Scale Invariance Principle". It treats Adam as a continuous ODE. However, for mini-batch optimization, gradient noise cannot be ignored. While continuous modeling as an SDE might be acceptable, an ODE approach is not quite scientific. Therefore, I believe the starting point of this paper is somewhat questionable.

Following the idea of the original paper, I have made some adjustments to the proof process. We write each \boldsymbol{g}_k in \hat{\boldsymbol{v}}_t as \hat{\boldsymbol{m}}_t + (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t), and then treat \boldsymbol{g}_k - \hat{\boldsymbol{m}}_t as a small quantity. Performing a first-order expansion, we get:

\begin{aligned} \hat{\boldsymbol{v}}_t =&\, \frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} (\hat{\boldsymbol{m}}_t + \boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 \\ \approx &\, \frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} \hat{\boldsymbol{m}}_t^2 + \frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} 2\hat{\boldsymbol{m}}_t (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t) \\ \approx &\, \hat{\boldsymbol{m}}_t^2 + 2\hat{\boldsymbol{m}}_t \left(\frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} \boldsymbol{g}_k - \hat{\boldsymbol{m}}_t\right) \\ \end{aligned}

Then we hope that \hat{\boldsymbol{v}}_t should be as close to \hat{\boldsymbol{m}}_t^2 as possible, so the first-order term should be zero, which leads to \beta_2 = \beta_1. Why do we want \hat{\boldsymbol{v}}_t to be close to \hat{\boldsymbol{m}}_t^2? It is precisely because we hope the update amount \boldsymbol{u}_t = \hat{\boldsymbol{m}}_t/\sqrt{\hat{\boldsymbol{v}}_t} can be closer to \mathop{\mathrm{sign}}(\hat{\boldsymbol{m}}_t). Since \mathop{\mathrm{sign}} is bounded, it can better resist the perturbations caused by the scale changes of \boldsymbol{g}_t, thereby improving training stability.

It should be noted that the proof here is extremely simplified compared to the original paper, but it has captured its core idea and made corrections. The original paper first continuousized it into an ODE, which itself has certain factual errors, and then performed some loose approximation treatments, finally obtaining an expansion of \boldsymbol{u}_t = \mathop{\mathrm{sign}}(\boldsymbol{g}_t)(1 + \cdots). This accuracy is not as reliable as our direct expansion at \hat{\boldsymbol{m}}_t.

Double Optimization

There is a common idea in these two papers, which is to make \boldsymbol{u}_t bounded to improve training stability. This inspired me to think in reverse: assuming \beta_1 is given, what value should \beta_2 take to make |\boldsymbol{u}_t| as small as possible? Since \boldsymbol{u}_t = \hat{\boldsymbol{m}}_t/\sqrt{\hat{\boldsymbol{v}}_t}, intuitively, we should make \hat{\boldsymbol{v}}_t as large as possible. This inspired me to consider the following double optimization problem:

\max_{\beta_2} \min_{\boldsymbol{g}_1,\cdots,\boldsymbol{g}_t}\underbrace{\frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} \boldsymbol{g}_k^2}_{\hat{\boldsymbol{v}}_t},\qquad \text{s.t.}\qquad \frac{1-\beta_1}{1-\beta_1^t}\sum_{k=1}^t \beta_1^{t-k} \boldsymbol{g}_k = \hat{\boldsymbol{m}}_t

The notation here is not particularly rigorous; just understand it component-wise. The \min over \boldsymbol{g}_1,\cdots,\boldsymbol{g}_t in the objective cannot be removed; it represents that the chosen \beta_2 should be as optimal as possible under any gradient sequence. This optimization problem looks complex but is actually not difficult; it can be solved layer by layer using the Cauchy-Schwarz inequality. First, solving the inner minimization problem, we have:

\sum_{k=1}^t \frac{p_k^2}{q_k}\times \sum_{k=1}^t q_k \boldsymbol{g}_k^2 \geq \left(\sum_{k=1}^t p_k \boldsymbol{g}_k\right)^2 = \hat{\boldsymbol{m}}_t^2

where p_k = \frac{1-\beta_1}{1-\beta_1^t}\beta_1^{t-k} and q_k = \frac{1-\beta_2}{1-\beta_2^t}\beta_2^{t-k}. Thus, the result of the inner minimization is \frac{\hat{\boldsymbol{m}}_t^2}{\sum_{k=1}^t p_k^2/q_k}. To maximize this, we need to minimize \sum_{k=1}^t p_k^2/q_k. Using Cauchy-Schwarz again:

\sum_{k=1}^t p_k^2/q_k = \sum_{k=1}^t q_k \times \sum_{k=1}^t p_k^2/q_k \geq \left(\sum_{k=1}^t p_k\right)^2 = 1

Equality holds when q_k=p_k, which means \beta_2 = \beta_1. Therefore, overall, taking \beta_2=\beta_1 provides the best training stability.

Summary

In this article, we analyzed the \beta_1, \beta_2 parameters of the Adam optimizer. From the perspective of stability, we showed that \beta_1=\beta_2 is usually a better choice, which can be understood as steepest descent under signal-to-noise ratio awareness.

Reprinting please include the address of this article: https://kexue.fm/archives/11593

For more detailed reprinting matters, please refer to: "Scientific Space FAQ"