Making ``Alchemy'' More Scientific (I): Convergence of Average Loss in SGD · English (unofficial) translations of posts at kexue.fm

Often, we jokingly refer to the training process of deep learning models as “alchemy”. This is because, much like ancient alchemy, the process seems to have some scientific basis but overall feels mysterious and esoteric. Although this site has previously followed some work related to optimizers and even written the “Optimization Algorithms from a Dynamical Perspective” series, those were relatively superficial introductions that did not delve into deeper theory. To make future alchemy more scientific, I have decided to study some theoretical results related to optimization, striving to provide more theoretical support for the path of alchemy.

In this article, we will learn a very fundamental convergence result for Stochastic Gradient Descent (SGD). Although this conclusion may seem coarse and impractical by modern standards, it was a very important attempt at proving the convergence of optimizers. In particular, it considers the characteristic that we actually use stochastic gradient descent (SGD) rather than full gradient descent (GD), making the conclusion more meaningful for reference.

Problem Setting

Let the loss function be L(\boldsymbol{x}, \boldsymbol{\theta}), where \boldsymbol{x} is the training set and \boldsymbol{\theta} \in \mathbb{R}^d represents the training parameters. Limited by computational power, we can usually only perform Stochastic Gradient Descent (SGD). That is, at each step, we can only sample a subset of the training data to calculate the loss function and update the parameters. Assuming the sampling is independent and identically distributed (i.i.d.), and the subset sampled at step t is \boldsymbol{x}_t, we can reasonably assume that the ultimate goal of the actual optimization is: L(\boldsymbol{\theta}) = \lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}) \label{eq:loss} In practice, we can only train for a finite number of steps, so we assume T is a sufficiently large positive integer constant. Our goal is to find the minimum point of L(\boldsymbol{\theta}), i.e., we hope to find \boldsymbol{\theta}^*: \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}\in\mathbb{R}^d} L(\boldsymbol{\theta}) \label{eq:argmin} Now, we consider the following SGD iteration: \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t) \label{eq:sgd} where \eta_t > 0 is the learning rate, and \boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}) \triangleq \nabla_{\boldsymbol{\theta}}L(\boldsymbol{x}_t, \boldsymbol{\theta}) is the gradient of L(\boldsymbol{x}_t, \boldsymbol{\theta}) with respect to \boldsymbol{\theta}. Our task is to analyze whether \boldsymbol{\theta}_t can converge to the target point \boldsymbol{\theta}^* as the iteration continues.

Preliminary Conclusion

First, we present the final inequality to be proven: under appropriate assumptions, we have \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}^*) \leq \frac{D^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T\eta_t \label{neq:core} where D and G are constants independent of the optimization process. Later, we will introduce what the “appropriate assumptions” specifically are. Before that, let’s observe the specific meaning expressed by inequality [neq:core]:

1. The first term on the left side is the average result of the loss function at each step during the optimization process;

2. The second term on the left side, according to equation [eq:loss], can be considered as L(\boldsymbol{\theta}^*) when T is large enough;

3. Combining the left side, it represents the difference between the average loss during optimization and the theoretical minimum of the loss function;

4. The right side is an expression that depends only on the learning rate strategy \{\eta_t\}.

Synthesizing points 1, 2, 3, and 4, inequality [neq:core] states: Under appropriate assumptions, the gap between the average loss of SGD and the ideal target we are looking for can be controlled by an expression that depends only on the learning rate strategy. If we can choose an appropriate learning rate to make this expression tend to zero, it means that the average loss of SGD will definitely converge to the theoretical optimum. (Of course, theoretically, this conclusion only guarantees finding the minimum value of the loss function L(\boldsymbol{\theta}^*), but it cannot guarantee finding the specific minimum point \boldsymbol{\theta}^*.)

Simply put, this is a theoretical result regarding the conditions under which SGD will converge. By the way, the left side of inequality [neq:core] has a special name called “Regret.”

Two Examples

For example, assuming the learning rate is a constant \eta, the right side of inequality [neq:core] becomes: \frac{D^2}{2T\eta} + \frac{G^2}{2T}\sum_{t=1}^T\eta = \frac{D^2}{2T\eta} + \frac{G^2\eta}{2} \geq \frac{DG}{\sqrt{T}} The equality holds when \eta = \frac{D}{G\sqrt{T}}. That is, if the learning rate is taken as the constant \frac{D}{G\sqrt{T}}, then we have: \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}^*) \leq \frac{DG}{\sqrt{T}} \label{neq:case-1} As T \to \infty, the right side tends to zero. This means that when the number of training steps T is large enough, setting the learning rate to the constant \frac{D}{G\sqrt{T}} can make the gap between the SGD iteration average and the theoretical optimum arbitrarily small.

Another example is considering the decay strategy \eta_t = \frac{\alpha}{\sqrt{t}}. Using the approximation: \sum_{t=1}^T \frac{1}{\sqrt{t}} = 1 + \sum_{t=2}^T \frac{1}{\sqrt{t}} \leq 1 + \sum_{t=2}^T \frac{2}{\sqrt{t-1} + \sqrt{t}} = 1 + \sum_{t=2}^T 2(\sqrt{t}-\sqrt{t-1}) = 2\sqrt{T}-1 < 2\sqrt{T} Substituting this into equation [neq:core] gives: \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - \frac{1}{T}\sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}^*) < \frac{D^2}{2\alpha\sqrt{T}} + \frac{G^2\alpha}{\sqrt{T}} \label{neq:case-2} Both equation [neq:case-2] and equation [neq:case-1] are \mathcal{O}\left(\frac{1}{\sqrt{T}}\right) with respect to T; therefore, they can both theoretically converge. Compared to equation [neq:case-1], the constant in equation [neq:case-2] is larger, which means that the convergence speed of \eta_t \equiv \frac{D}{G\sqrt{T}} is likely faster than \eta_t = \frac{\alpha}{\sqrt{t}}. However, in practice, we prefer the latter because the former requires pre-determining the total number of training steps T, ending the training once finished with a fixed precision. The latter has no such restrictions; one doesn’t even need to tune \alpha—simply using \eta_t = \frac{1}{\sqrt{t}} allows training to continue indefinitely, and theoretically, the gap between the average loss and the theoretical minimum will become smaller and smaller.

Even so, a learning rate strategy like \eta_t = \frac{1}{\sqrt{t}} is still far from what we usually use in training, whether in terms of magnitude or variation patterns. Therefore, it is not hard to guess that many strong assumptions must have been added. Without further ado, let’s start the proof process and unfold the assumptions one by one.

Proof Process

At the beginning of the proof, we assume that for any \boldsymbol{x}, L(\boldsymbol{x}, \boldsymbol{\theta}) is a convex function with respect to \boldsymbol{\theta}. This is a very strong assumption that usually does not align with training facts, but there is no choice; theoretical analysis usually can only be done under strong assumptions, and then these conclusions are applied heuristically to actual scenarios.

Convex functions have many different definitions; here we directly adopt the following: L(\boldsymbol{x}_t, \boldsymbol{\theta}_2) - L(\boldsymbol{x}_t, \boldsymbol{\theta}_1) \geq (\boldsymbol{\theta}_2-\boldsymbol{\theta}_1)\cdot\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_1),\quad \forall \boldsymbol{\theta}_1, \boldsymbol{\theta}_2 \label{eq:convex} where \boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_1) \triangleq \nabla_{\boldsymbol{\theta}}L(\boldsymbol{x}_t, \boldsymbol{\theta}_1) is the gradient of L(\boldsymbol{x}_t, \boldsymbol{\theta}) with respect to \boldsymbol{\theta}, and \cdot denotes the vector dot product. The geometric meaning of the above definition is that the graph of a convex function always lies above its tangent line (or plane).

The key to the proof is to consider the distance between \boldsymbol{\theta}_{t+1} and \boldsymbol{\theta}^*: \begin{aligned} \Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}^*\Vert^2 =&\, \Vert\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - \boldsymbol{\theta}^*\Vert^2 \\ =&\, \Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert^2 - 2\eta_t (\boldsymbol{\theta}_t - \boldsymbol{\theta}^*)\cdot\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t) + \eta_t^2\Vert\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\Vert^2 \end{aligned} Rewriting this as: (\boldsymbol{\theta}_t - \boldsymbol{\theta}^*)\cdot\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t) = \frac{\Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert^2}{2\eta_t} - \frac{\Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}^*\Vert^2}{2\eta_t} + \frac{1}{2}\eta_t\Vert\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\Vert^2 According to equation [eq:convex], we have L(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - L(\boldsymbol{x}_t, \boldsymbol{\theta}^*) \leq (\boldsymbol{\theta}_t - \boldsymbol{\theta}^*)\cdot\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t). Substituting this into the above equation gives: L(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - L(\boldsymbol{x}_t, \boldsymbol{\theta}^*) \leq \frac{\Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert^2}{2\eta_t} - \frac{\Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}^*\Vert^2}{2\eta_t} + \frac{1}{2}\eta_t\Vert\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\Vert^2 Summing both sides for t=1, 2, \dots, T: \sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - \sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}^*) \leq \sum_{t=1}^T \left(\frac{\Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert^2}{2\eta_t} - \frac{\Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}^*\Vert^2}{2\eta_t}\right) + \sum_{t=1}^T \frac{1}{2}\eta_t\Vert\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\Vert^2 \label{neq:base} Now we introduce a new assumption: \eta_t is a monotonically decreasing function of t (i.e., \eta_t \geq \eta_{t+1}). Let D = \max_t \Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert. Then we have: \begin{aligned} & \sum_{t=1}^T \left(\frac{\Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert^2}{2\eta_t} - \frac{\Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}^*\Vert^2}{2\eta_t}\right) \\ =& \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2\eta_1} - \frac{\Vert\boldsymbol{\theta}_{T+1} - \boldsymbol{\theta}^*\Vert^2}{2\eta_T} + \sum_{t=2}^T \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}\right)\Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert^2 \\ \leq& \frac{D^2}{2\eta_1} + \sum_{t=2}^T \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}\right)D^2 \\ =& \frac{D^2}{2\eta_T} \end{aligned} Finally, let G = \max_t \Vert\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\Vert. Substituting this into equation [neq:base] yields: \sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - \sum_{t=1}^T L(\boldsymbol{x}_t, \boldsymbol{\theta}^*) \leq \frac{D^2}{2\eta_T} + \frac{G^2}{2}\sum_{t=1}^T \eta_t Dividing both sides by T gives inequality [neq:core].

Note that the constants D and G are currently optimization-dependent, meaning one must first determine the learning rate strategy \{\eta_t\} and complete the optimization process to obtain D and G. To make D and G constants independent of the optimization, we need to assume that for any \{\eta_t\}, \Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert and \Vert\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\Vert do not exceed some constants D and G, respectively.

Domain Projection

However, the assumption “for any \{\eta_t\}, \Vert\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\Vert and \Vert\boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\Vert do not exceed some constants D and G” feels a bit like “putting the cart before the horse”: it seems we need to finish the optimization to determine D and G, while our goal is to use the proven result to improve the optimization. To remove this strange feeling, we simply change these two assumptions to:

1. The \boldsymbol{\theta} \in \mathbb{R}^d in equation [eq:argmin] is changed to \boldsymbol{\theta} \in \boldsymbol{\Theta}, where \boldsymbol{\Theta} \subseteq \mathbb{R}^d is a bounded convex set;

Bounded: D = \max_{\boldsymbol{\theta}_1, \boldsymbol{\theta}_2 \in \boldsymbol{\Theta}} \Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}_2\Vert < \infty;

Convex Set: \forall \boldsymbol{\theta}_1, \boldsymbol{\theta}_2 \in \boldsymbol{\Theta} and \forall \lambda \in [0, 1], \lambda \boldsymbol{\theta}_1 + (1-\lambda)\boldsymbol{\theta}_2 \in \boldsymbol{\Theta}.

2. For any \boldsymbol{\theta} \in \boldsymbol{\Theta} and any \boldsymbol{x}, \Vert\boldsymbol{g}(\boldsymbol{x}, \boldsymbol{\theta})\Vert \leq G.

The second point is easier to accept; it’s just another assumption on the loss function L(\boldsymbol{x}, \boldsymbol{\theta}). Since we’ve already assumed convexity, adding a few more constraints doesn’t hurt. But the first point seems less intuitive: convexity is acceptable because convex functions are defined on convex sets, but how do we ensure boundedness? That is, how do we ensure the output of iteration [eq:sgd] remains bounded?

The answer is to “add a projection step.” We define the projection operation: \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi}) = \mathop{\text{argmin}}_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\Vert\boldsymbol{\varphi}-\boldsymbol{\theta}\Vert which finds the vector in \boldsymbol{\Theta} closest to \boldsymbol{\varphi}. Thus, we can modify equation [eq:sgd] to: \boldsymbol{\theta}_{t+1} = \Pi_{\boldsymbol{\Theta}}\big(\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t)\big) \in \boldsymbol{\Theta} \label{eq:sgd-p} This ensures that the iteration results always stay within the set \boldsymbol{\Theta}.

Does the previous proof and conclusion (specifically inequality [neq:core]) still hold after this modification? Fortunately, yes. We only need to prove that for the projected SGD defined by equation [eq:sgd-p], we have: \Vert\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}^*\Vert \leq \Vert\boldsymbol{\theta}_t - \eta_t \boldsymbol{g}(\boldsymbol{x}_t, \boldsymbol{\theta}_t) - \boldsymbol{\theta}^*\Vert Then the derivation in the “Proof Process” section can still proceed, with some equal signs becoming \leq. To this end, we only need to prove that for any \boldsymbol{\varphi} \in \mathbb{R}^d and \boldsymbol{\theta} \in \boldsymbol{\Theta}: \Vert\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi}) - \boldsymbol{\theta}\Vert \leq \Vert \boldsymbol{\varphi} - \boldsymbol{\theta}\Vert

Proof. The key is to combine the definition of a convex set with the definition of \Pi_{\boldsymbol{\Theta}}. First, by the definition of a convex set, we know that for any \lambda \in (0, 1), \lambda\boldsymbol{\theta} + (1-\lambda)\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi}) \in \boldsymbol{\Theta}. Then, by the definition of \Pi_{\boldsymbol{\Theta}}, it always holds that: \Vert\boldsymbol{\varphi} - \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})\Vert \leq \Vert\boldsymbol{\varphi} - \lambda\boldsymbol{\theta} - (1-\lambda)\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})\Vert = \Vert(\boldsymbol{\varphi} - \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})) + \lambda(\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta})\Vert Squaring both sides and subtracting: \lambda^2\Vert\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta}\Vert^2 + 2\lambda(\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta})\cdot(\boldsymbol{\varphi} - \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})) \geq 0 Since \lambda \in (0, 1), we can divide by \lambda: \lambda\Vert\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta}\Vert^2 + 2(\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta})\cdot(\boldsymbol{\varphi} - \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})) \geq 0 This holds for all \lambda \in (0, 1), so as \lambda \to 0^+, it must also hold that: (\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta})\cdot(\boldsymbol{\varphi} - \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})) \geq 0 Adding \Vert\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta}\Vert^2 + \Vert\boldsymbol{\varphi} - \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})\Vert^2 to both sides, the left side becomes exactly \Vert\boldsymbol{\varphi}-\boldsymbol{\theta}\Vert^2, so: \Vert\boldsymbol{\varphi}-\boldsymbol{\theta}\Vert^2 \geq \Vert\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta}\Vert^2 + \Vert\boldsymbol{\varphi} - \Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})\Vert^2 \geq \Vert\Pi_{\boldsymbol{\Theta}} (\boldsymbol{\varphi})-\boldsymbol{\theta}\Vert^2 ◻

Assumption Analysis

We have now completed the proof. The results above are from the 2003 paper “Online Convex Programming and Generalized Infinitesimal Gradient Ascent”. Note that there is a vast amount of literature on optimization, and as a beginner, I may have errors or omissions in tracing the sources; readers are welcome to correct me.

Let’s summarize all the assumptions used in the complete proof:

1. \boldsymbol{\Theta} is a bounded convex set, D = \max_{\boldsymbol{\theta}_1, \boldsymbol{\theta}_2 \in \boldsymbol{\Theta}} \Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}_2\Vert < \infty;

2. For any \boldsymbol{\theta} \in \boldsymbol{\Theta} and any \boldsymbol{x}, L(\boldsymbol{x}, \boldsymbol{\theta}) is a convex function with respect to \boldsymbol{\theta};

3. For any \boldsymbol{\theta} \in \boldsymbol{\Theta} and any \boldsymbol{x}, \Vert\nabla_{\boldsymbol{\theta}}L(\boldsymbol{x}, \boldsymbol{\theta})\Vert \leq G < \infty;

4. The learning rate \eta_t is a monotonically decreasing function of t (i.e., \eta_t \geq \eta_{t+1}).

Under these assumptions, the projected SGD in equation [eq:sgd-p] satisfies inequality [neq:core].

Among these, assumptions 1 and 4 are quite reasonable. For practical computation, a sufficiently large sphere is essentially the same as \mathbb{R}^d, and a decreasing learning rate aligns with common practice. The strongest and most unrealistic assumption is the second one (convexity), but almost all optimization theories are based on it. We can only hope that after entering a certain region, the loss function partially exhibits convex properties. The third assumption is also strong, but in practice, if initialization is good and the learning rate is appropriate, the gradient norm can usually be controlled.

Summary

In this article, we revisited an old paper on convex optimization and introduced a fundamental convergence proof for SGD: under appropriate (actually very strong) assumptions, the convergence of SGD can be guaranteed. Although these assumptions may not always hold in practice, such as the convexity and gradient norm limits, these theoretical results still provide important insights into the convergence of SGD.

Original address: https://kexue.fm/archives/9902

For more details on reprinting, please refer to: Scientific Space FAQ