English (unofficial) translations of posts at kexue.fm
Source

Making Alchemy More Scientific (II): Generalizing Conclusions to Unbounded Domains

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

Two years ago, I intended to start a "Scientific Alchemy" series, aiming to systematically organize the classic theoretical results of optimizers. However, after writing the first post, "Making Alchemy More Scientific (I): Average Loss Convergence of SGD", the project has been shelved until now. The main reason was that I felt the conditions relied upon by these classic optimization conclusions were too stringent and far removed from practical applications. Especially in the LLM era, the reference value of these conclusions seemed even more limited, so I lacked the motivation to continue writing.

However, while recently reflecting on issues related to Scaling Laws, I discovered that these theoretical results are not as "useless" as I had imagined. They can provide beneficial theoretical insights into some empirical findings. Therefore, I am restarting this series to continue the development of these thematic articles and "repay" the "debts" I owed previously.

Conclusion Review

Regarding notation, we will follow the first article, so I will not repeat the introduction of symbols. The main conclusion of the first article was that under appropriate assumptions, SGD satisfies: \begin{equation} \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t,\boldsymbol{\theta}^*)\leq \frac{R^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T\eta_t \label{leq:avg-1} \end{equation} where R and G are constants independent of the optimization trajectory, and the "appropriate assumptions" include:

1. \boldsymbol{\Theta} is a bounded convex set, where R=\max\limits_{\boldsymbol{\theta}_1,\boldsymbol{\theta}_2\in \boldsymbol{\Theta}}\Vert\boldsymbol{\theta}_1-\boldsymbol{\theta}_2\Vert < \infty;

2. For any \boldsymbol{\theta}\in \boldsymbol{\Theta} and any \boldsymbol{x}, L(\boldsymbol{x},\boldsymbol{\theta}) is a convex function with respect to \boldsymbol{\theta};

3. For any \boldsymbol{\theta}\in \boldsymbol{\Theta} and any \boldsymbol{x}, \Vert\nabla_{\boldsymbol{\theta}}L(\boldsymbol{x},\boldsymbol{\theta})\Vert\leq G < \infty;

4. The learning rate \eta_t is a monotonically decreasing function of t (i.e., \eta_t\geq \eta_{t+1});

The most "awkward" part here is likely the "boundedness" in the first point regarding the "bounded convex set." Since the optimization trajectory of SGD cannot, in principle, be guaranteed to be bounded, we must modify SGD into Projected SGD to ensure boundedness: \begin{equation} \boldsymbol{\theta}_{t+1} = \Pi_{\boldsymbol{\Theta}}\big(\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\big)\in \boldsymbol{\Theta} \label{eq:sgd-p} \end{equation} Of course, boundedness is perfectly fine in practice; appropriate boundedness can even enhance the stability of optimization algorithms. However, from the perspective of "theoretical purism," adding an extra constraint is always somewhat uncomfortable. Therefore, in this article, we will first remove the condition of boundedness to obtain a simpler and more enlightening proof.

New Ideas

Incidentally, the proofs in this section and the subsequent chapters of this article mainly refer to the post "Last Iterate of SGD Converges (Even in Unbounded Domains)" from the blog Parameter-free. This is a classic blog on optimization theory; besides this post, there are many theoretical introductions to optimizers, many of which are original to the author, and it is well worth reading.

As with the proof in the previous article, our starting point is the identity: \begin{equation} \begin{aligned} \Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\varphi}\Vert^2=&\, \Vert\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)- \boldsymbol{\varphi}\Vert^2 \\ =&\, \Vert\boldsymbol{\theta}_t - \boldsymbol{\varphi}\Vert^2 - 2\eta_t (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) + \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \end{aligned} \end{equation} Here \boldsymbol{\varphi} is an arbitrary vector. In the previous article, we directly set \boldsymbol{\varphi}=\boldsymbol{\theta}^*, but here we retain the possibility of it taking any value. After this identity, the previous article adopted a scheme of dividing both sides by 2\eta_t and then summing over t, which necessitated the boundedness condition for easy manipulation in subsequent relaxations. In this article, we do not intend to divide by 2\eta_t; instead, we sum both sides directly over t, so that the intermediate \Vert\boldsymbol{\theta}_t - \boldsymbol{\varphi}\Vert^2 terms naturally cancel out: \begin{equation} \begin{aligned} 2\sum_{t=1}^T \eta_t (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) =&\, \Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2 - \Vert\boldsymbol{\theta}_{T+1} - \boldsymbol{\varphi}\Vert^2 + \sum_{t=1}^T \eta_t^2 \Vert\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t)\Vert^2 \\ \leq&\, \Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2 + G^2\sum_{t=1}^T \eta_t^2 \end{aligned} \end{equation} Using the convexity of L, we know that (\boldsymbol{\theta}_t- \boldsymbol{\varphi})\cdot\boldsymbol{g}(\boldsymbol{x}_t,\boldsymbol{\theta}_t) \geq L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi}). Substituting this into the left side of the above equation and rearranging gives: \begin{equation} \sum_{t=1}^T \eta_t [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{G^2}{2}\sum_{t=1}^T \eta_t^2 \label{leq:avg-2-mid1} \end{equation} This is already very close to the result we want, and no boundedness condition was used during the relaxation process. Thus, we only need to make an assumption about the distance between the initialization \boldsymbol{\theta}_1 and the target point \boldsymbol{\varphi}.

Adding Expectation

Next, if we could further assume the non-negativity of L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi}), we could combine it with the monotonic decrease of \eta_t to get \eta_t [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})] \geq \eta_T [L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})], and then substitute this into the above equation.

The problem is that even if we set \boldsymbol{\varphi} to be the global optimum \boldsymbol{\theta}^*, we cannot guarantee that L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi}) is non-negative, because the optimum is optimal for the entire dataset but not necessarily for a specific sample. To avoid this trouble, we notice that the right side of Eq. [leq:avg-2-mid1] is independent of \boldsymbol{x}_t, while the left side is dependent on \boldsymbol{x}_t. Therefore, by taking the expectation of both sides, the inequality still holds: \begin{equation} \sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{G^2}{2}\sum_{t=1}^T \eta_t^2 \label{leq:avg-2-mid2} \end{equation} Here \mathbb{E} is \mathbb{E}_{\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_T}, which means "repeating SGD infinitely many times with different sampling orders and averaging each optimization trajectory." Then, since \boldsymbol{\varphi} is an arbitrary vector independent of the data, we have \mathbb{E}[L(\boldsymbol{x}_t,\boldsymbol{\varphi})] = \mathbb{E}_{\boldsymbol{x}_t}[L(\boldsymbol{x}_t,\boldsymbol{\varphi})] = L(\boldsymbol{\varphi}), which is exactly our optimization objective. As for L(\boldsymbol{x}_t,\boldsymbol{\theta}_t), note that \boldsymbol{\theta}_t depends on \boldsymbol{x}_1,\cdots,\boldsymbol{x}_{t-1}, so we have: \begin{equation} \mathbb{E}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)] = \mathbb{E}_{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_t}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)] = \mathbb{E}_{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_{t-1}}[\mathbb{E}_{\boldsymbol{x}_t}[L(\boldsymbol{x}_t,\boldsymbol{\theta}_t)]] = \mathbb{E}_{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_{t-1}}[L(\boldsymbol{\theta}_t)] = \mathbb{E}[L(\boldsymbol{\theta}_t)] \end{equation} Due to the data independence of \boldsymbol{\varphi}, L(\boldsymbol{\varphi}) can also be written as \mathbb{E}[L(\boldsymbol{\varphi})]. Substituting these into Eq. [leq:avg-2-mid2] yields: \begin{equation} \sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{G^2}{2}\sum_{t=1}^T \eta_t^2 \label{leq:avg-2-mid3} \end{equation}

Thought Discussion

Before proceeding further, I would like to discuss a few points: Why and how can we introduce this expectation form?

The answer to "why" is simple: to allow the derivation to proceed. For example, we want the non-negativity of L(\boldsymbol{x}_t,\boldsymbol{\theta}_t) - L(\boldsymbol{x}_t,\boldsymbol{\varphi}), but this cannot be guaranteed. By adding the expectation, we can obtain the non-negativity of L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi}), provided that \boldsymbol{\varphi} is the theoretical optimum. In general, taking the expectation eliminates the dependence on the data variable \boldsymbol{x}_t and simplifies subsequent derivations, which is the primary purpose of this step.

As for "how," some readers might argue that we don’t actually repeat training many times to find an average, so why care about an expected result? To answer this, we must consider: what do we essentially want? In fact, we only want a convergence conclusion for SGD to some extent. Since the objective function is restricted to convex functions, we hope this conclusion is mathematically rigorous.

For a complex stochastic process, the mathematical expectation is often the most basic result we can calculate, and perhaps the only one. Therefore, adding expectations to both sides, while causing the conclusion to deviate slightly from practice, still provides a valuable convergence result. As is well known, practice is usually more complex, so as long as it has some connection to practice, can describe some phenomena, and generate some inspiration, it is a valuable theoretical result.

Monotonic Relaxation

Returning to the main topic. Now with Eq. [leq:avg-2-mid3], if we set \boldsymbol{\varphi} to be the theoretical optimum \boldsymbol{\theta}^*, we obtain the non-negativity of L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*). Then, assuming the monotonic decrease of \eta_t, we can replace \eta_t with \eta_T, and finally divide both sides by T\eta_T to get: \begin{equation} \frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T \frac{\eta_t^2}{\eta_T} \label{leq:avg-2} \end{equation} Comparing Eq. [leq:avg-2] with Eq. [leq:avg-1], the dependence on T and \eta_t is similar: the R^2 in the first term on the right is replaced by \Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2, which is relatively smaller; the \eta_t in the summation of the second term is replaced by \eta_t^2/\eta_T, which is usually larger. Between one smaller and one larger, is the difference significant? Let’s look at two examples. The first is a constant learning rate \frac{\alpha}{\sqrt{T}}. In this case, the conclusions of [leq:avg-1] and [leq:avg-2] are essentially the same. Taking [leq:avg-2] as an example, substituting gives: \begin{equation} \frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{1}{2\sqrt{T}}\left(\frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{\alpha} + G^2\alpha\right) \end{equation} Both achieve a convergence rate of \mathcal{O}(1/\sqrt{T}). The second example is a dynamic learning rate \eta_t = \frac{\alpha}{\sqrt{t}}. As proved in the previous article, substituting this into [leq:avg-1] also yields a convergence rate of \mathcal{O}(1/\sqrt{T}). However, if substituted into [leq:avg-2], the result is: \begin{equation} \frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2 \alpha \sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}}\sum_{t=1}^T \frac{1}{t} \sim \mathcal{O}\left(\frac{\ln T}{\sqrt{T}}\right) \end{equation} Therefore, the cost of generalizing to an unbounded domain is that the convergence rate is slightly increased from \mathcal{O}(1/\sqrt{T}) to \mathcal{O}(\ln T/\sqrt{T}). This degree of amplification makes no difference in practice. Furthermore, this result is only for \eta_t = \frac{\alpha}{\sqrt{t}}. By setting \eta_t = \frac{\alpha}{\sqrt{t\ln t}}, one can approach \mathcal{O}(\sqrt{\ln T/T}), but this is essentially a derivation game with no fundamental difference.

Weighted Average

In fact, there is an even simpler way to handle Eq. [leq:avg-2-mid3]: directly divide both sides by \sum_{t=1}^T \eta_t, and then substitute \boldsymbol{\varphi}=\boldsymbol{\theta}^*: \begin{equation} \frac{\sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)]}{\sum_{t=1}^T \eta_t}\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2\sum_{t=1}^T \eta_t} + \frac{G^2}{2}\frac{\sum_{t=1}^T \eta_t^2}{\sum_{t=1}^T \eta_t} \label{leq:avg-3} \end{equation} This does not require the monotonic decrease assumption for the learning rate, making it a very relaxed convergence conclusion. It describes the convergence properties of the weighted average along the optimization trajectory. Some readers might question why we use weighting. The answer remains: "we only want a convergence conclusion for SGD to some extent"—so from the perspective of approaching the optimal loss value, what does it matter whether it is weighted or not?

To some extent, Eq. [leq:avg-3] provides more inspiration than Eq. [leq:avg-2]. For example, to make the right side converge as quickly as possible, the sum of learning rates \sum_{t=1}^T \eta_t should be as large as possible, while the sum of squares \sum_{t=1}^T \eta_t^2 should be as small as possible. If we want to converge to the optimum as T\to\infty, the learning rate should satisfy: \begin{equation} \sum_{t=1}^{\infty} \eta_t = \infty \qquad\text{and}\qquad \sum_{t=1}^{\infty} \eta_t^2 \bigg/ \sum_{t=1}^{\infty} \eta_t = 0 \end{equation} This yields the two classic conditions for learning rate strategies. Furthermore, Eq. [leq:avg-3] does not explicitly depend on the final learning rate \eta_T, thus allowing \eta_T\to 0, which gives us more flexibility in scheduling the learning rate. In short, Eq. [leq:avg-3] gives a more "composed" feeling and is aesthetically more pleasing.

Summary

In this article, we restarted the "Scientific Alchemy" series, generalizing the conclusion of SGD convergence in bounded domains from the previous article to unbounded domains, obtaining richer results.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/11469

For more detailed reprinting matters, please refer to: "Scientific Space FAQ"