English (unofficial) translations of posts at kexue.fm
Source

Making ``Alchemy'' More Scientific (V): Fine-tuning Learning Rates Based on Gradients

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

In the previous four articles, we explored a series of convergence conclusions for SGD, ranging from bounded to unbounded domains and from average loss to final loss. Some readers might feel that, after all this discussion, we are still just talking about SGD—perhaps these are results from the “ancient era”? Not at all! For instance, the core identity relied upon in the fourth article, “Making Alchemy More Scientific (IV): New Identity, New Learning Rate”, dates back to as recently as 2023. The conclusions in the third article, “Making Alchemy More Scientific (III): SGD’s Final Loss Convergence”, are only slightly older, originating from 2020.

Also in the fourth article, we derived the commonly used “linear decay” learning rate strategy, demonstrating that this series of theoretical derivations is not merely “armchair theorizing” but can provide effective guidance for practice. Next, we will discuss more refined learning rate strategies based on gradients. This helps us understand the principles of learning rate scheduling and serves as the foundation for various adaptive learning rate optimizers.

The Initial Starting Point

If we carefully review the previous proofs, we find that the starting point for this series of conclusions is a seemingly unremarkable identity: \begin{equation} \begin{aligned} \Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\varphi}\Vert^2 &= \Vert\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)- \boldsymbol{\varphi}\Vert^2 \\ &= \Vert\boldsymbol{\theta}_t - \boldsymbol{\varphi}\Vert^2 - 2\eta_t (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) + \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \end{aligned} \label{eq:begin} \end{equation} The reason it is called “unremarkable” is that it is so trivial; any competent high school student could understand and prove it. Yet, such a simple identity supports a whole series of conclusions in stochastic optimization, making one marvel at how the greatest truths are often the simplest.

To understand this identity, we should note two aspects of “arbitrariness.” First, \boldsymbol{\varphi} is arbitrary. In most cases, we set it to the theoretical optimal solution \boldsymbol{\theta}^* to obtain the most valuable results, but this does not change the inherent arbitrariness of \boldsymbol{\varphi}. We already know that one of the keys to the derivations in the third and fourth articles was substituting \boldsymbol{\varphi}=\boldsymbol{\theta}_{T-k} and \boldsymbol{\varphi}=\boldsymbol{\theta}_k.

Second, \boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) is arbitrary. At first glance, this might be surprising because previous articles assumed \boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) to be the gradient of the loss function \nabla_{\boldsymbol{\theta}_t}L(\boldsymbol{x}_t,\boldsymbol{\theta}_t). However, in reality, the only purpose of setting it as the gradient is to combine it with the convexity of L to obtain: \begin{equation} L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi}) \leq (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) \end{equation} thereby linking the identity to the loss function. If we do not need this point, or have other ways to achieve it, then it is not strictly necessary. This arbitrariness is crucial; it allows us to construct more complex update rules and is the key to extending results to non-SGD optimizers later on.

Classical Results

Rearranging Equation [eq:begin] slightly gives: \begin{equation} 2\eta_t (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)= \Vert\boldsymbol{\theta}_t - \boldsymbol{\varphi}\Vert^2 - \Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\varphi}\Vert^2 + \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \label{eq:begin-2} \end{equation} The “old-school” approach is to first divide both sides by 2\eta_t and then sum over t from 1 to T. The advantage is that the left side can directly establish an inequality relationship with the loss through convexity. The disadvantage is that it requires assuming a bounded domain and a non-increasing learning rate to easily bound the right side. So far, we have only used the “old-school” method in the first article, “Making Alchemy More Scientific (I): Average Loss Convergence of SGD”, with the final result being: \begin{equation} \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}^*)\leq \frac{R^2}{2T\eta_T} + \frac{1}{2T}\sum_{t=1}^T\eta_t\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \label{leq:old-avg} \end{equation} This time, we explicitly keep \Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert without assuming \Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert\leq G for further simplification. Since we assume the learning rate is non-increasing, we have: \begin{equation} \frac{R^2}{2T\eta_T} + \frac{1}{2T}\sum_{t=1}^T\eta_t\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2\geq \frac{R^2}{2T\eta_T} + \frac{\eta_T}{2T}\sum_{t=1}^T\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \geq \frac{R}{T}\sqrt{\sum_{t=1}^T\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2} \label{leq:old-avg-optimal} \end{equation} Let V_t = \sum_{k=1}^t\Vert\boldsymbol{g}(\boldsymbol{x}_k,\boldsymbol{\theta}_k)\Vert^2. The condition for equality is \eta_1 = \eta_2 = \cdots = \eta_T = R/\sqrt{V_T}. That is, the right side of inequality [leq:old-avg] reaches its minimum at a constant learning rate R/\sqrt{V_T}, which is the fastest-converging learning rate. However, this result violates causality; at time t, we cannot “foresee” the gradient magnitudes of future steps. One way to modify it to comply with causality is: \begin{equation} \eta_t = \frac{R}{\sqrt{V_t}} = \frac{R}{\sqrt{\sum_{k=1}^t\Vert\boldsymbol{g}(\boldsymbol{x}_k,\boldsymbol{\theta}_k)\Vert^2}} \label{eq:adagrad-mini} \end{equation} After this modification, we need to re-prove the bound: \begin{equation} \sum_{t=1}^T\eta_t\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 = R\sum_{t=1}^T\frac{V_t - V_{t-1}}{\sqrt{V_t}}\leq 2R\sum_{t=1}^T\frac{V_t - V_{t-1}}{\sqrt{V_t} + \sqrt{V_{t-1}}} = 2R\sum_{t=1}^T (\sqrt{V_t} - \sqrt{V_{t-1}}) = 2R\sqrt{V_T} \end{equation} Substituting this back into inequality [leq:old-avg] yields: \begin{equation} \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}^*)\leq \frac{3R}{2T}\sqrt{\sum_{t=1}^T\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2} \end{equation} The result is only 50% larger than the ideal case in [leq:old-avg-optimal], which is quite good. Crucially, it does not violate causality and is a practically feasible learning rate strategy. This is the prototype of the AdaGrad optimizer. By applying the form of Equation [eq:adagrad-mini] element-wise to each component, we obtain the standard version of AdaGrad, which we will return to later.

Be Careful with Expectations

Starting from the second article, “Making Alchemy More Scientific (II): Extending Conclusions to Unbounded Domains”, we used a “new-school” processing method, which sums both sides of Equation [eq:begin-2] directly: \begin{equation} \begin{aligned} \sum_{t=1}^T 2\eta_t (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) &= \Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2 - \Vert\boldsymbol{\theta}_{T+1} - \boldsymbol{\varphi}\Vert^2 + \sum_{t=1}^T \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \\ &\leq \Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2 + \sum_{t=1}^T \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \end{aligned} \end{equation} Clearly, the advantage of the “new-school” method is that it does not assume monotonicity of the learning rate, nor does it require a bounded domain; the term \Vert\boldsymbol{\theta}_t - \boldsymbol{\varphi}\Vert^2 on the right side cancels out naturally. The cost is that the sum on the left side has an additional weight \eta_t. This weight is not a major issue for SGD, but for adaptive learning rate optimizers, it is almost fatal. Thus, despite the many subtleties of the new-school method, it almost never extends to adaptive learning rate optimizers, at which point we must revert to the “old-school” method.

Without straying too far, let’s return to the analysis of SGD. Using convexity on the left side of the above equation gives: \begin{equation} \sum_{t=1}^T \eta_t [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{1}{2}\sum_{t=1}^T \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \end{equation} In the first three articles, the step immediately following this was to take the mathematical expectation \mathbb{E} on both sides. But here we must be very careful! Taking the expectation is over all \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T. In the first three articles, we assumed the learning rate \eta_t was independent of the data, so \mathbb{E}[\eta_t [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})]] = \eta_t\mathbb{E} [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})] = \eta_t\mathbb{E} [L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})], establishing the link between the left side and the target loss. However, in this article, we are considering gradient-dependent learning rates, so \eta_t cannot be simply factored out.

A relatively simple remedy is to assume \eta_t depends at most on \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_{t-1}. In this case, the expectation of \boldsymbol{x}_t can be taken separately, e.g., \mathbb{E}[\eta_t L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)] = \mathbb{E}[\eta_t \mathbb{E}_{\boldsymbol{x}_t}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)]] = \mathbb{E}[\eta_t L(\boldsymbol{\theta}_t)]. But this is somewhat restrictive, so we might as well assume \eta_t is data-independent. Does this mean we cannot implement gradient-adjusted learning rates? Not at all; it just means we can only implement adjustments based on the expected gradient G_t = \sqrt{\mathbb{E}[\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2]}, i.e., \begin{equation} \sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})]\leq \frac{\mathbb{E}[\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2]}{2} + \frac{1}{2}\sum_{t=1}^T \eta_t^2\underbrace{\mathbb{E}[\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2]}_{G_t^2} \label{leq:E-L} \end{equation}

Still the Old

With inequality [leq:E-L], subsequent derivations are relatively straightforward. We will generalize the conclusions from the second, third, and fourth articles one by one to see what new insights they offer for learning rate adjustment. The first conclusion of the second article, “Making Alchemy More Scientific (II): Extending Conclusions to Unbounded Domains”, assumed non-increasing learning rates. Replacing \eta_t on the left with \eta_T and substituting \boldsymbol{\varphi}=\boldsymbol{\theta}^* gives: \begin{equation} \frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{R^2}{2T\eta_T} + \frac{1}{2T\eta_T}\sum_{t=1}^T \eta_t^2 G_t^2 \end{equation} where R = \Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert. Since the learning rate is non-increasing, it is easy to prove that the right side is minimized under a constant learning rate, where: \begin{equation} \eta_t = \frac{R}{ \sqrt{\sum_{k=1}^T G_k^2}} \end{equation} Of course, this result also violates causality, so we can only modify it to: \begin{equation} \eta_t = \frac{R}{\sqrt{\sum_{k=1}^t G_k^2}} = \frac{R}{ \sqrt{\sum_{k=1}^t \mathbb{E}[\Vert\boldsymbol{g}(\boldsymbol{x}_k,\boldsymbol{\theta}_k)\Vert^2]}} \end{equation} However, the \mathbb{E} in the denominator is also unrealistic; it would mean running the process many times to take an average. We simply remove \mathbb{E} as an approximation, or assume that the variance of \Vert\boldsymbol{g}(\boldsymbol{x}_k,\boldsymbol{\theta}_k)\Vert^2 is very small, so a single sample is accurate enough. In subsequent results, we default to this operation. In short, after removing the expectation \mathbb{E}, the result is essentially the same as [eq:adagrad-mini], offering no new insights for now.

Inverse Gradient

The second conclusion of “Making Alchemy More Scientific (II): Extending Conclusions to Unbounded Domains” is a weighted average form of the inequality: \begin{equation} \frac{\sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)]}{\sum_{t=1}^T \eta_t}\leq \frac{R^2}{2\sum_{t=1}^T \eta_t} + \frac{\sum_{t=1}^T \eta_t^2 G_t^2}{2\sum_{t=1}^T \eta_t} \label{leq:avg-weighted} \end{equation} Using the Cauchy-Schwarz inequality: \begin{equation} \sum_{t=1}^T \eta_t^2 G_t^2 = \frac{\sum_{t=1}^T (\eta_t G_t)^2 \sum_{t=1}^T G_t^{-2}}{\sum_{t=1}^T G_t^{-2}} \geq \frac{\left(\sum_{t=1}^T \eta_t G_t G_t^{-1}\right)^2}{\sum_{t=1}^T G_t^{-2}} = \frac{\left(\sum_{t=1}^T \eta_t\right)^2}{\sum_{t=1}^T G_t^{-2}} \end{equation} The condition for equality is \eta_t G_t \propto G_t^{-1}, i.e., \eta_t \propto G_t^{-2}. Substituting this into the right side of [leq:avg-weighted], we have: \begin{equation} \frac{R^2}{2\sum_{t=1}^T \eta_t} + \frac{\sum_{t=1}^T \eta_t^2 G_t^2}{2\sum_{t=1}^T \eta_t} \geq \frac{R^2}{2\sum_{t=1}^T \eta_t} + \frac{\sum_{t=1}^T \eta_t}{2\sum_{t=1}^T G_t^{-2}}\geq \frac{R}{\sqrt{\sum_{t=1}^T G_t^{-2}}} \end{equation} The total condition for equality is \eta_t = R G_t^{-2}/\sqrt{Q_T}, where Q_t = \sum_{k=1}^t G_k^{-2}. The most unique aspect of this result is that it tells us the learning rate should be inversely proportional to the square of the gradient magnitude. This can be used to explain the necessity of Warmup, as gradient magnitudes are larger at the beginning of training and then decrease, remaining approximately constant for a long time. Modifying it to a causal form gives: \begin{equation} \eta_t = \frac{R G_t^{-2}}{\sqrt{Q_t}} = \frac{R G_t^{-2}}{\sqrt{\sum_{k=1}^t G_k^{-2}}} \end{equation} After modification, we must re-prove the bounds: \begin{gather} \sum_{t=1}^T \eta_t = \sum_{t=1}^T \frac{R(Q_t - Q_{t-1})}{\sqrt{Q_t}} \geq \sum_{t=1}^T \frac{R(Q_t - Q_{t-1})}{\sqrt{Q_t} + \sqrt{Q_{t-1}}} = \sum_{t=1}^T R(\sqrt{Q_t} - \sqrt{Q_{t-1}}) = R \sqrt{Q_T} \\ \sum_{t=1}^T \eta_t^2 G_t^2 = \sum_{t=1}^T \frac{R^2(Q_t - Q_{t-1})}{Q_t} = R^2 + \sum_{t=2}^T \frac{R^2(Q_t - Q_{t-1})}{Q_t} \leq R^2 + R^2\sum_{t=2}^T \ln \frac{Q_t}{Q_{t-1}}= R^2+ R^2\ln \frac{Q_T}{Q_1} \end{gather} Substituting these two results into [leq:avg-weighted] yields: \begin{equation} \frac{\sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)]}{\sum_{t=1}^T \eta_t}\leq \frac{R}{\sqrt{Q_T}}\left(1 + \frac{1}{2}\ln \frac{Q_T}{Q_1}\right) \end{equation} The core difference from the optimal solution is the addition of a logarithmically growing factor \ln (Q_T/Q_1), which is a common phenomenon when changing from a static to a dynamic learning rate.

Focus on the Present

Starting from the third article, “Making Alchemy More Scientific (III): SGD’s Final Loss Convergence”, our conclusions began to focus on final loss convergence. Generalizing the core conclusion of that article to dynamic gradients results in: \begin{equation} \mathbb{E}[L(\boldsymbol{\theta}_T) - L(\boldsymbol{\theta}^*)] \leq \frac{R^2}{2T\eta_T} + \frac{1}{2\eta_T}\sum_{t=1}^{T}\frac{\eta_t^2 G_t^2}{\max(1,\,T-t)} \label{leq:last-1} \end{equation} Since this result also requires the learning rate to be non-increasing, it is easy to prove that the minimum of the right side is R\sqrt{V_T/T}, achieved at a constant learning rate: \begin{equation} \eta_t = \frac{R}{\sqrt{T V_T}},\qquad V_t = \sum_{k=1}^t\frac{G_k^2}{\max(1,\,t-k)} \end{equation} This learning rate is also very interesting. It has a denominator similar to the “mini” AdaGrad learning rate [eq:adagrad-mini], where V_t is a sum of squared gradients. The difference here is that the gradient at time k is weighted by 1/\max(1,\,t-k), meaning it focuses more on the current gradient. This is a new feature brought by moving from average loss to final loss, and it is very close to the approach of RMSProp, Adam, etc., which update the second moment via EMA.

Intuitively, the causal version would be \eta_t = R/\sqrt{t V_t}, but this is not precise enough. The correct version should be \eta_t = R/\sqrt{T V_t}. Substituting this into the right side of Equation [leq:last-1] gives: \begin{equation} \begin{aligned} \frac{R^2}{2T\eta_T} + \frac{1}{2\eta_T}\sum_{t=1}^{T}\frac{\eta_t^2 G_t^2}{\max(1,\,T-t)} &= \frac{R}{2}\sqrt{\frac{V_T}{T}}\left(1 + \sum_{t=1}^{T}\frac{G_t^2/V_t}{\max(1,\,T-t)}\right) \\ &\leq \frac{R}{2}\sqrt{\frac{V_T}{T}}\left(1 + \sum_{t=1}^{T}\frac{1}{\max(1,\,T-t)}\right) \\ &\leq \frac{R}{2}\sqrt{\frac{V_T}{T}} (3 + \ln T) \end{aligned} \end{equation} Here we used G_t^2\leq V_t. Compared to the optimal solution, it similarly adds a logarithmically growing factor \ln T.

The Masterpiece

In “Making Alchemy More Scientific (IV): New Identity, New Learning Rate”, we obtained the strongest final loss convergence result to date. Generalizing it to dynamic gradients yields the “masterpiece” combining Equation [leq:avg-weighted] and Equation [leq:last-1]: \begin{equation} \mathbb{E}[L(\boldsymbol{\theta}_T) - L(\boldsymbol{\theta}^*)] \leq \frac{R^2}{2\eta_{1:T}} + \frac{1}{2}\sum_{t=1}^T\frac{\eta_t^2 G_t^2}{\eta_{\min(t+1, T):T}} \label{leq:last-2} \end{equation} This result looks simple but is actually quite complex; at least, we cannot intuitively see what learning rate pattern minimizes the right side. However, in the previous article, I provided a hint: use variational methods after continuous approximation. Let’s try that here: let S_t = \eta_{\min(t+1, T):T}. Then for t < T - 1, we have \eta_t = S_{t-1} - S_t \approx -\dot{S}_t. Approximating \eta_t with -\dot{S}_t and the sum with an integral, the right side of the above equation is approximately (substituting S_t = W_t^2): \begin{equation} \frac{R^2}{2S_0} + \frac{1}{2}\int_0^T \frac{\dot{S}_t^2 G_t^2}{S_t}dt = \frac{R^2}{2W_0^2} + 2\int_0^T \dot{W}_t^2 G_t^2 dt \label{eq:int-approx} \end{equation} By definition W_T=0. Fixing W_0, the integral part becomes a variational problem with fixed boundaries. The Euler-Lagrange equation gives \frac{d}{dt}(\dot{W}_t G_t^2)=0, so \dot{W}_t \propto G_t^{-2}. Integrating both sides and using W_T=0 gives W_t = W_0\int_t^T G_s^{-2} ds/\int_0^T G_s^{-2} ds. Substituting this back into [eq:int-approx] gives R^2/2W_0^2 + 2 W_0^2 / \int_0^T G_s^{-2} ds. The minimum is reached at 2W_0^2 = R(\int_0^T G_s^{-2} ds)^{1/2}. Finally, using the approximation \eta_t \approx -\dot{S}_t = -2W_t\dot{W}_t, we eventually get: \begin{equation} \eta_t \approx \frac{R G_t^{-2} \int_t^T G_s^{-2}ds}{(\int_0^T G_s^{-2}ds)^{3/2}} \end{equation} If we restore discretization, we can guess the optimal learning rate is roughly of the form: \begin{equation} \eta_t = \frac{R G_t^{-2} (Q_T - Q_t)}{Q_T^{3/2}} = \frac{R G_t^{-2}}{\sqrt{Q_T}} (1 - Q_t/Q_T) \label{eq:last-2-opt-lr} \end{equation} where Q_t is the previously defined \sum_{k=1}^t G_k^{-2}. In fact, this “guess” is the correct answer! However, substituting it back into Equation [leq:last-2] for verification is extremely difficult because the denominator in Equation [leq:last-2] is \eta_{t+1:T} rather than \eta_{t:T}, so we cannot guarantee that \eta_t / \eta_{t+1:T} is bounded, making the bounding process particularly hard. I tried for a week without success. We won’t continue trying here but will wait for the next article to prove it through a more ingenious construction.

Now let’s appreciate Equation [eq:last-2-opt-lr]. Note that R G_t^{-2}/\sqrt{Q_T} is exactly the optimal learning rate for [leq:avg-weighted], characterized by being proportional to G_t^{-2}, which we said explains Warmup. Equation [eq:last-2-opt-lr] multiplies this by a term (1 - Q_t/Q_T), which is monotonically decreasing. Under appropriate assumptions, this is linear decay. Thus, this term explains Decay. Therefore, Equation [eq:last-2-opt-lr] corresponds exactly to the classic “Warmup-Decay” learning rate strategy.

Even more interestingly, the optimal learning rate for [leq:last-1] in the previous section told us to focus more on the current gradient, while Equation [eq:last-2-opt-lr] is more extreme: it only depends on the current and future gradients, completely ignoring historical gradients. This undoubtedly violates causality. How can we modify it to be causal? We could consider replacing \sqrt{Q_T} with \sqrt{Q_t}. As for Q_t/Q_T, we can write Q_T as (Q_T/T)\times T and then consider replacing Q_T/T with Q_t/t, resulting in: \begin{equation} \eta_t = \frac{R G_t^{-2}}{\sqrt{Q_t}} (1 - t/T) \end{equation} This looks reasonable, but proving it is truly effective by substituting it back into Equation [leq:last-2] is no easy task.

Summary

Starting from this article, we consider gradient-based learning rate scheduling. This helps us understand the principles of strategies like Warmup and Decay and provides useful references for various adaptive learning rate optimizers.

Original Address: https://kexue.fm/archives/11530

For more details on reposting, please refer to: Scientific Space FAQ