English (unofficial) translations of posts at kexue.fm
Source

Adaptive Learning Rate Optimizers from the Perspective of Hessian Approximation

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

Recently, I have been revisiting a paper from Meta last year titled "A Theory on Adam Instability in Large-Scale Machine Learning". It provides a new perspective on adaptive learning rate optimizers like Adam: it points out that the moving average of squared gradients approximates the estimation of the square of the Hessian matrix to some extent. Thus, optimizers such as Adam and RMSprop actually approximate the second-order Newton’s method.

This perspective is quite novel and appears to differ significantly from some previous Hessian approximations, making it well worth our study and reflection.

Newton’s Method

Let the loss function be \mathcal{L}(\boldsymbol{\theta}), where the parameters to be optimized are \boldsymbol{\theta}. Our optimization goal is: \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})\label{eq:loss} \end{equation} Assuming the current value of \boldsymbol{\theta} is \boldsymbol{\theta}_t, Newton’s method seeks \boldsymbol{\theta}_{t+1} by expanding the loss function to the second order: \begin{equation} \mathcal{L}(\boldsymbol{\theta})\approx \mathcal{L}(\boldsymbol{\theta}_t) + \boldsymbol{g}_t^{\top}(\boldsymbol{\theta} - \boldsymbol{\theta}_t) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_t)^{\top}\boldsymbol{\mathcal{H}}_t(\boldsymbol{\theta} - \boldsymbol{\theta}_t) \end{equation} where \boldsymbol{g}_t = \nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}_t) is the gradient and \boldsymbol{\mathcal{H}}_t=\nabla_{\boldsymbol{\theta}_t}^2\mathcal{L}(\boldsymbol{\theta}_t) is the Hessian matrix. Assuming the positive definiteness of the Hessian matrix, a unique minimum exists for the right-hand side of the above equation at \boldsymbol{\theta}_t - \boldsymbol{\mathcal{H}}_t^{-1}\boldsymbol{g}_t. Newton’s method takes this as the next step \boldsymbol{\theta}_{t+1}: \begin{equation} \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t-\boldsymbol{\mathcal{H}}_t^{-1}\boldsymbol{g}_t = \boldsymbol{\theta}_t - (\nabla_{\boldsymbol{\theta}_t}^2\mathcal{L})^{-1} \nabla_{\boldsymbol{\theta}_t}\mathcal{L} \end{equation} Note that the above equation has no additional learning rate parameter; therefore, Newton’s method is inherently an adaptive learning rate algorithm. Of course, since the complexity of the Hessian matrix is proportional to the square of the number of parameters, the full Newton’s method has essentially only theoretical value in deep learning. To actually apply Newton’s method, one must make significant simplifying assumptions about the Hessian matrix, such as assuming it is a diagonal or low-rank matrix.

From the perspective of Newton’s method, SGD assumes \boldsymbol{\mathcal{H}}_t=\eta_t^{-1}\boldsymbol{I}, while Adam assumes \boldsymbol{\mathcal{H}}_t=\eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon), where: \begin{equation} \text{Adam}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t\odot\boldsymbol{g}_t\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t / \left(1 - \beta_1^t\right)\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t / \left(1 - \beta_2^t\right)\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t / \left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right) \end{aligned}\right. \end{equation} Next, we want to prove that \eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t}) is a better approximation of \boldsymbol{\mathcal{H}}_t.

Gradient Approximation

The key to the proof is to use the first-order approximation of the gradient: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}} \approx \boldsymbol{g}_{\boldsymbol{\theta}^*} + \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*) \end{equation} where \boldsymbol{g}_{\boldsymbol{\theta}^*} and \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*} indicate that we are expanding at \boldsymbol{\theta}=\boldsymbol{\theta}^*. Here \boldsymbol{\theta}^* is the target we are seeking in Eq. [eq:loss], where the gradient of the model is zero, so the above equation can be simplified to: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}} \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*) \end{equation} Thus: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}}\boldsymbol{g}_{\boldsymbol{\theta}}^{\top} \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*)(\boldsymbol{\theta} - \boldsymbol{\theta}^*)^{\top}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} \end{equation} Assuming that after the training gets "on track," the model will remain in a state of "circling" around \boldsymbol{\theta}^*, converging slowly and spirally. To some extent, we can treat \boldsymbol{\theta} - \boldsymbol{\theta}^* as a random variable following a normal distribution \mathcal{N}(\boldsymbol{0},\sigma^2\boldsymbol{I}). Then: \begin{equation} \mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\boldsymbol{g}_{\boldsymbol{\theta}}^{\top}] \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\mathbb{E}[(\boldsymbol{\theta} - \boldsymbol{\theta}^*)(\boldsymbol{\theta} - \boldsymbol{\theta}^*)^{\top}]\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} = \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top}\label{eq:hessian-2} \end{equation} Assuming the Hessian matrix is diagonal, we can retain only the diagonal elements in the above equation: \begin{equation} \text{diag}(\mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]) \approx \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^2\quad\Rightarrow\quad \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*} = \frac{1}{\sigma}\text{diag}(\sqrt{\mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]}) \end{equation} Does this look familiar? Adam’s \hat{\boldsymbol{v}}_t is a moving average of the squared gradients, which can be seen as an approximation of \mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]. Finally, assuming that \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t} does not change much compared to \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}, we can conclude that \eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t}) is an approximation of \boldsymbol{\mathcal{H}}_t.

This also explains why Adam’s \beta_2 is usually greater than \beta_1. To estimate the Hessian more accurately, the moving average of \hat{\boldsymbol{v}}_t should be as "long-term" as possible (close to a uniform average), so \beta_2 should be very close to 1. On the other hand, the momentum \hat{\boldsymbol{m}}_t is a moving average of the gradient. If the average of the gradient is too long-term, the result will approach \boldsymbol{g}_{\boldsymbol{\theta}^*}=\boldsymbol{0}, which is counterproductive. Therefore, the moving average of momentum should be more local.

More Connections

In the previous derivation, we assumed \boldsymbol{\theta}^* is the theoretical optimum, so \boldsymbol{g}_{\boldsymbol{\theta}^*} = \boldsymbol{0}. What if \boldsymbol{\theta}^* is an arbitrary point? Then Eq. [eq:hessian-2] becomes: \begin{equation} \mathbb{E}[(\boldsymbol{g}_{\boldsymbol{\theta}}-\boldsymbol{g}_{\boldsymbol{\theta}^*})(\boldsymbol{g}_{\boldsymbol{\theta}}-\boldsymbol{g}_{\boldsymbol{\theta}^*})^{\top}] \approx \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} \end{equation} In other words, as long as we are moving-averaging the covariance instead of the second moment, we can obtain a local Hessian approximation. This corresponds exactly to the approach of the AdaBelief optimizer, where its \boldsymbol{v} is a moving average of the square of the difference between \boldsymbol{g} and \boldsymbol{m}: \begin{equation} \text{AdaBelief}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) (\boldsymbol{g}_t - \boldsymbol{m}_t)\odot(\boldsymbol{g}_t - \boldsymbol{m}_t)\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t / \left(1 - \beta_1^t\right)\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t / \left(1 - \beta_2^t\right)\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t / \left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right) \end{aligned}\right. \end{equation}

Conclusion

This article introduced a perspective on adaptive learning rate optimizers like Adam from the viewpoint of Newton’s method and Hessian approximation, and discussed related results on Hessian approximation.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/10588

Further details on reprinting: Please refer to: "Scientific Space FAQ"