Adaptive Learning Rate Optimizers from the Perspective of Hessian Approximation · English (unofficial) translations of posts at kexue.fm

Newton’s Method

Let the loss function be \mathcal{L}(\boldsymbol{\theta}), where the parameters to be optimized are \boldsymbol{\theta}. Our optimization goal is: \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})\label{eq:loss} \end{equation} Assuming the current value of \boldsymbol{\theta} is \boldsymbol{\theta}_t, Newton’s method seeks \boldsymbol{\theta}_{t+1} by expanding the loss function to the second order: \begin{equation} \mathcal{L}(\boldsymbol{\theta})\approx \mathcal{L}(\boldsymbol{\theta}_t) + \boldsymbol{g}_t^{\top}(\boldsymbol{\theta} - \boldsymbol{\theta}_t) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_t)^{\top}\boldsymbol{\mathcal{H}}_t(\boldsymbol{\theta} - \boldsymbol{\theta}_t) \end{equation} where \boldsymbol{g}_t = \nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}_t) is the gradient and \boldsymbol{\mathcal{H}}_t=\nabla_{\boldsymbol{\theta}_t}^2\mathcal{L}(\boldsymbol{\theta}_t) is the Hessian matrix. Assuming the positive definiteness of the Hessian matrix, a unique minimum exists for the right-hand side of the above equation at \boldsymbol{\theta}_t - \boldsymbol{\mathcal{H}}_t^{-1}\boldsymbol{g}_t. Newton’s method takes this as the next step \boldsymbol{\theta}_{t+1}: \begin{equation} \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t-\boldsymbol{\mathcal{H}}_t^{-1}\boldsymbol{g}_t = \boldsymbol{\theta}_t - (\nabla_{\boldsymbol{\theta}_t}^2\mathcal{L})^{-1} \nabla_{\boldsymbol{\theta}_t}\mathcal{L} \end{equation} Note that the above equation has no additional learning rate parameter; therefore, Newton’s method is inherently an adaptive learning rate algorithm. Of course, since the complexity of the Hessian matrix is proportional to the square of the number of parameters, the full Newton’s method has essentially only theoretical value in deep learning. To actually apply Newton’s method, one must make significant simplifying assumptions about the Hessian matrix, such as assuming it is a diagonal or low-rank matrix.

From the perspective of Newton’s method, SGD assumes \boldsymbol{\mathcal{H}}_t=\eta_t^{-1}\boldsymbol{I}, while Adam assumes \boldsymbol{\mathcal{H}}_t=\eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon), where: \begin{equation} \text{Adam}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t\odot\boldsymbol{g}_t\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t / \left(1 - \beta_1^t\right)\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t / \left(1 - \beta_2^t\right)\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t / \left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right) \end{aligned}\right. \end{equation} Next, we want to prove that \eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t}) is a better approximation of \boldsymbol{\mathcal{H}}_t.

Gradient Approximation

The key to the proof is to use the first-order approximation of the gradient: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}} \approx \boldsymbol{g}_{\boldsymbol{\theta}^*} + \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*) \end{equation} where \boldsymbol{g}_{\boldsymbol{\theta}^*} and \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*} indicate that we are expanding at \boldsymbol{\theta}=\boldsymbol{\theta}^*. Here \boldsymbol{\theta}^* is the target we are seeking in Eq. [eq:loss], where the gradient of the model is zero, so the above equation can be simplified to: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}} \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*) \end{equation} Thus: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}}\boldsymbol{g}_{\boldsymbol{\theta}}^{\top} \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*)(\boldsymbol{\theta} - \boldsymbol{\theta}^*)^{\top}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} \end{equation} Assuming that after the training gets "on track," the model will remain in a state of "circling" around \boldsymbol{\theta}^*, converging slowly and spirally. To some extent, we can treat \boldsymbol{\theta} - \boldsymbol{\theta}^* as a random variable following a normal distribution \mathcal{N}(\boldsymbol{0},\sigma^2\boldsymbol{I}). Then: \begin{equation} \mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\boldsymbol{g}_{\boldsymbol{\theta}}^{\top}] \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\mathbb{E}[(\boldsymbol{\theta} - \boldsymbol{\theta}^*)(\boldsymbol{\theta} - \boldsymbol{\theta}^*)^{\top}]\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} = \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top}\label{eq:hessian-2} \end{equation} Assuming the Hessian matrix is diagonal, we can retain only the diagonal elements in the above equation: \begin{equation} \text{diag}(\mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]) \approx \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^2\quad\Rightarrow\quad \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*} = \frac{1}{\sigma}\text{diag}(\sqrt{\mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]}) \end{equation} Does this look familiar? Adam’s \hat{\boldsymbol{v}}_t is a moving average of the squared gradients, which can be seen as an approximation of \mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]. Finally, assuming that \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t} does not change much compared to \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}, we can conclude that \eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t}) is an approximation of \boldsymbol{\mathcal{H}}_t.

This also explains why Adam’s \beta_2 is usually greater than \beta_1. To estimate the Hessian more accurately, the moving average of \hat{\boldsymbol{v}}_t should be as "long-term" as possible (close to a uniform average), so \beta_2 should be very close to 1. On the other hand, the momentum \hat{\boldsymbol{m}}_t is a moving average of the gradient. If the average of the gradient is too long-term, the result will approach \boldsymbol{g}_{\boldsymbol{\theta}^*}=\boldsymbol{0}, which is counterproductive. Therefore, the moving average of momentum should be more local.

Related Work

For readers familiar with Hessian matrix theory, the first reaction to the above conclusion might be confusion rather than familiarity. This is because a classic approximation of the Hessian matrix is the outer product of the Jacobian matrix (similar to the gradient), whereas the Hessian approximation here is the square root of the outer product of the gradient. The two differ by a square root.

Specifically, taking the mean squared error loss as an example: \begin{equation} \mathcal{L}(\boldsymbol{\theta}) = \frac{1}{2}\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\Vert \boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})\Vert^2]\label{eq:loss-2} \end{equation} Expanding at \boldsymbol{\theta}_t, we have \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})\approx \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x}) + \boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}^{\top} (\boldsymbol{\theta} - \boldsymbol{\theta}_t), where \boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}=\nabla_{\boldsymbol{\theta}_t} \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x}) is the Jacobian matrix. Substituting this into the above equation gives: \begin{equation} \mathcal{L}(\boldsymbol{\theta}) \approx \frac{1}{2}\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\Vert \boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x}) - \boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}^{\top} (\boldsymbol{\theta} - \boldsymbol{\theta}_t)\Vert^2] \end{equation} The simplified equation above is just a quadratic form with respect to \boldsymbol{\theta}, so its Hessian matrix can be written directly as: \begin{equation} \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t} \approx \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}^{\top}] \end{equation} This is the Hessian approximation based on the outer product of the Jacobian matrix, which is the theoretical foundation of the Gauss-Newton method. Of course, \boldsymbol{\mathcal{J}} is not yet \boldsymbol{g}; we need to try to relate the result to \boldsymbol{g}. Directly differentiating Eq. [eq:loss-2] gives: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}} = \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))] \end{equation} Thus: \begin{equation} \begin{aligned} \boldsymbol{g}_{\boldsymbol{\theta}} \boldsymbol{g}_{\boldsymbol{\theta}}^{\top} &= \big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))]\big)\big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))]\big)^{\top} \\[5pt] &= \big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))]\big)\big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))^{\top}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}]\big) \\[5pt] &\approx \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\big[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))^{\top}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}\big] \\[5pt] &\approx \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\Big[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\big[(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))^{\top}\big]\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}\Big] \end{aligned} \end{equation} The two approximation signs here don’t have much rigorous justification; they can be loosely viewed as a mean-field approximation. Since \boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}) is the residual of the regression prediction, we usually assume it follows \mathcal{N}(\boldsymbol{0},\sigma^2\boldsymbol{I}), so: \begin{equation} \boldsymbol{g}_{\boldsymbol{\theta}} \boldsymbol{g}_{\boldsymbol{\theta}}^{\top} \approx \sigma^2\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\big[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}\big] \approx \sigma^2 \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t}\label{eq:hessian-t} \end{equation} This reveals the connection between \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t} and \boldsymbol{g}_{\boldsymbol{\theta}} \boldsymbol{g}_{\boldsymbol{\theta}}^{\top}. Comparing this with Eq. [eq:hessian-2] from the previous section, we see that they appear to differ exactly by a square.

How to understand this inconsistency? We can interpret it this way: Eq. [eq:hessian-t] provides a "snapshot" Hessian approximation at time t, whereas Eq. [eq:hessian-2] is a "long-term average" result over time steps. The long-term averaging effect cancels out some of the intensity (though theoretically making the estimate more accurate), thus requiring an additional square root.

A similar effect appears in the SDEs introduced in "Introduction to Generative Diffusion Models (V): SDEs in a General Framework". The noise term intensity in an SDE needs to be half an order higher than the non-noise term, also because the noise term cancels out under long-term averaging, so the noise needs to be higher-order to manifest its effect in the final result.

More Connections

In the previous derivation, we assumed \boldsymbol{\theta}^* is the theoretical optimum, so \boldsymbol{g}_{\boldsymbol{\theta}^*} = \boldsymbol{0}. What if \boldsymbol{\theta}^* is an arbitrary point? Then Eq. [eq:hessian-2] becomes: \begin{equation} \mathbb{E}[(\boldsymbol{g}_{\boldsymbol{\theta}}-\boldsymbol{g}_{\boldsymbol{\theta}^*})(\boldsymbol{g}_{\boldsymbol{\theta}}-\boldsymbol{g}_{\boldsymbol{\theta}^*})^{\top}] \approx \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} \end{equation} In other words, as long as we are moving-averaging the covariance instead of the second moment, we can obtain a local Hessian approximation. This corresponds exactly to the approach of the AdaBelief optimizer, where its \boldsymbol{v} is a moving average of the square of the difference between \boldsymbol{g} and \boldsymbol{m}: \begin{equation} \text{AdaBelief}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) (\boldsymbol{g}_t - \boldsymbol{m}_t)\odot(\boldsymbol{g}_t - \boldsymbol{m}_t)\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t / \left(1 - \beta_1^t\right)\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t / \left(1 - \beta_2^t\right)\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t / \left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right) \end{aligned}\right. \end{equation}

Conclusion

This article introduced a perspective on adaptive learning rate optimizers like Adam from the viewpoint of Newton’s method and Hessian approximation, and discussed related results on Hessian approximation.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/10588

Further details on reprinting: Please refer to: "Scientific Space FAQ"