Currently, we have already discussed the convergence properties of SGD in two articles. However, they only addressed the convergence of the loss values, which guarantees that we can find the optimal loss value but does not guarantee finding the location of the optimal value \boldsymbol{\theta}^*. This is a significant gap between current theoretical conclusions and practice. Intuitively, the weights \boldsymbol{\theta}_T at the end of training should be closer to the theoretical optimum \boldsymbol{\theta}^*, and we want to know if theory supports this.
Therefore, in this article, we will transform the convergence results of the average loss into convergence results for the last-iterate loss, providing a preliminary theoretical understanding of how far \boldsymbol{\theta}_T is from \boldsymbol{\theta}^*.
Finding the Position
We start from the article "Making Alchemy More Scientific (II): Generalizing Conclusions to Unbounded Domains". Its core result is the inequality: \begin{equation} \sum_{t=1}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})]\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\varphi}\Vert^2}{2} + \frac{G^2}{2}\sum_{t=1}^T \eta_t^2 \label{leq:avg-2-mid3} \end{equation} Assuming the monotonicity of \eta_t (decreasing), we replace \eta_t on the left side with \eta_T. By substituting \boldsymbol{\varphi}=\boldsymbol{\theta}^*, we obtain one of the conclusions from the previous post: \begin{equation} \frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)] \leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T \frac{\eta_t^2}{\eta_T} \label{leq:avg-2} \end{equation} As mentioned at the beginning, this is only a convergence result for the loss value. We are more interested in finding the convergence of the position. To this end, a relatively simple idea is to use the convexity of L and Jensen’s inequality to obtain: \begin{equation} \frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t)] = \mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T L(\boldsymbol{\theta}_t)\right] \geq \mathbb{E}\left[L\left(\frac{1}{T}\sum_{t=1}^T \boldsymbol{\theta}_t\right)\right] \end{equation} Defining \bar{\boldsymbol{\theta}}_T = \frac{1}{T}\sum_{t=1}^T \boldsymbol{\theta}_t, we then have: \begin{equation} \mathbb{E}[L(\bar{\boldsymbol{\theta}}_T) - L(\boldsymbol{\theta}^*)] \leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T \frac{\eta_t^2}{\eta_T} \end{equation} This means that the loss value L(\bar{\boldsymbol{\theta}}_T) corresponding to the centroid \bar{\boldsymbol{\theta}}_T of the training trajectory \boldsymbol{\theta}_1, \boldsymbol{\theta}_2, \cdots, \boldsymbol{\theta}_T converges to L(\boldsymbol{\theta}^*) on average. This implies that \bar{\boldsymbol{\theta}}_T converges to \boldsymbol{\theta}^* on average (since the minimum point of a strictly convex function is unique). This explains to some extent the practice of using a moving average of the training trajectory to obtain better weights, including the rationality of the Merge operation in WSM (Warmup-Stable and Merge), etc.
Preparations
Calculating \bar{\boldsymbol{\theta}}_T provides a way to find \boldsymbol{\theta}^*, but it does not fully answer the question posed at the beginning of this article—we are more interested in the conclusion that \boldsymbol{\theta}_T converges to \boldsymbol{\theta}^*. Next, following the logic of "Last Iterate of SGD Converges (Even in Unbounded Domains)", we will transform the average loss convergence into last-iterate loss convergence.
Before the formal proof, some preparation is needed. One task is to generalize formula [leq:avg-2-mid3]. From its proof process, we know that the lower limit of the summation can theoretically be arbitrary. That is, the starting point \boldsymbol{\theta}_1 can be replaced by any \boldsymbol{\theta}_{T-k}, and the inequality still holds. However, at this point, \boldsymbol{\theta}_{T-k} might also be related to \boldsymbol{x}_t, so we must add \mathbb{E} to the right side, obtaining: \begin{equation} \sum_{t=T-k}^T \eta_t \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})]\leq \frac{\mathbb{E}[\Vert\boldsymbol{\theta}_{T-k} - \boldsymbol{\varphi}\Vert^2]}{2} + \frac{G^2}{2}\sum_{t=T-k}^T \eta_t^2 \label{leq:last-mid1} \end{equation} Again, assuming the monotonicity of \eta_t, we replace \eta_t on the left with \eta_T and divide both sides by \eta_T: \begin{equation} \sum_{t=T-k}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\varphi})] \leq \frac{\mathbb{E}[\Vert\boldsymbol{\theta}_{T-k} - \boldsymbol{\varphi}\Vert^2]}{2\eta_T} + \frac{G^2}{2} \sum_{t=T-k}^T\frac{\eta_t^2}{\eta_T} \end{equation} Here \boldsymbol{\varphi} is any data-independent vector. However, this "data-independence" is relative. Reviewing the proof process, we can conclude that when our starting point is chosen as T-k, \boldsymbol{\varphi} can be at most related to \boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_{T-k-1}. Specifically, \boldsymbol{\theta}_{T-k} satisfies this condition. Substituting \boldsymbol{\varphi}=\boldsymbol{\theta}_{T-k}, we get: \begin{equation} \sum_{t=T-k}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}_{T-k})] \leq \frac{G^2}{2} \sum_{t=T-k}^T \frac{\eta_t^2}{\eta_T} \label{leq:last-mid2} \end{equation} This is an important intermediate conclusion for later use.
Key Identity
To transform the average loss conclusion into the last-iterate form, we also need to prepare a very crucial identity: \begin{equation} q_T = \frac{1}{T}\sum_{t=1}^T q_t + \sum_{k=1}^{T-1} \frac{1}{k(k+1)}\sum_{t=T-k}^T (q_t - q_{T-k}) \label{eq:qt} \end{equation}
This identity cleverly establishes the connection between the last-iterate value and the average value. The author spent several days trying to find an intuitive understanding of it but failed, so we can only introduce its proof step-by-step. The idea of the proof is to consider the cumulative average of q_t from the end to the beginning. Define S_k = \frac{1}{k}\sum_{t=T-k+1}^T q_t, then we can write: \begin{equation} \begin{aligned} k S_k =&\, (k + 1) S_{k+1} - q_{T-k} \\[5pt] =&\, k S_{k+1} + (S_{k+1} - q_{T-k}) \\ =&\, k S_{k+1} + \frac{1}{k+1}\sum_{t=T-k}^T (q_t - q_{T-k}) \end{aligned} \end{equation} Dividing both sides by k and then summing over k=1 \sim T-1 gives: \begin{equation} S_1 = S_T + \sum_{k=1}^{T-1}\frac{1}{k(k+1)}\sum_{t=T-k}^T (q_t - q_{T-k}) \end{equation} Finally, substituting the original definitions of S_1 and S_T yields equation [eq:qt]. The core of the entire derivation is using the "cumulative average" operation as a natural transition from the last-iterate value q_T to the average value \frac{1}{T}\sum_{t=1}^T q_t. In the original blog, this formula appeared in a slightly different inequality form, but the author believes the identity is more fundamental, and the subsequent proof only requires the identity form.
Completing the Proof
Now we can complete the proof in one go. Define q_t = \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)]. Substituting this into identity [eq:qt] gives: \begin{equation} \mathbb{E}[L(\boldsymbol{\theta}_T) - L(\boldsymbol{\theta}^*)] = \underbrace{\frac{1}{T}\sum_{t=1}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}^*)]}_{\eqref{leq:avg-2}} + \sum_{k=1}^{T-1} \frac{1}{k(k+1)}\underbrace{\sum_{t=T-k}^T \mathbb{E}[L(\boldsymbol{\theta}_t) - L(\boldsymbol{\theta}_{T-k})]}_{\eqref{leq:last-mid2}} \end{equation} Substituting inequalities [leq:avg-2] and [leq:last-mid2] respectively, we get: \begin{equation} \mathbb{E}[L(\boldsymbol{\theta}_T) - L(\boldsymbol{\theta}^*)] \leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta_T} + \frac{G^2}{2T}\sum_{t=1}^T \frac{\eta_t^2}{\eta_T} + \frac{G^2}{2}\sum_{k=1}^{T-1} \frac{1}{k(k+1)}\sum_{t=T-k}^T \frac{\eta_t^2}{\eta_T} \label{leq:last-mid3} \end{equation} For the last term, we have: \begin{equation} \begin{aligned} \sum_{k=1}^{T-1}\frac{1}{k(k+1)}\sum_{t=T-k}^{T}\frac{\eta_t^2}{\eta_T} &= \sum_{t=1}^{T}\frac{\eta_t^2}{\eta_T}\sum_{k=\max(1,\,T-t)}^{T-1} \frac{1}{k(k+1)} \\ &= \sum_{t=1}^{T}\frac{\eta_t^2}{\eta_T}\left(\frac{1}{\max(1,\,T-t)}-\frac{1}{T}\right) \end{aligned} \label{eq:last-mid4} \end{equation} Substituting this back into equation [leq:last-mid3] yields: \begin{equation} \begin{aligned} \mathbb{E}[L(\boldsymbol{\theta}_T) - L(\boldsymbol{\theta}^*)] &\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta_T} + \frac{G^2}{2}\sum_{t=1}^{T}\frac{\eta_t^2/\eta_T}{\max(1,\,T-t)} \\ &= \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta_T} + \frac{G^2\eta_T}{2} + \frac{G^2}{2}\sum_{t=1}^{T-1}\frac{\eta_t^2/\eta_T}{T-t} \end{aligned} \label{leq:last-1} \end{equation} This is our final conclusion. Since we swapped the order of summation through the identity transformation [eq:last-mid4] and simplified it in advance, this result is more concise and general than the one in the original blog "Last Iterate of SGD Converges (Even in Unbounded Domains)".
Two Examples
It is not difficult to see that the shape of the right side of conclusion [leq:last-1] is similar to [leq:avg-2], which suggests that the last-iterate loss should have a similar convergence rate to the average loss. We still observe the performance of the final conclusion through two examples: static learning rate and dynamic learning rate. First, for a static learning rate \eta_t = \eta: \begin{equation} \begin{aligned} \mathbb{E}[L(\boldsymbol{\theta}_T) - L(\boldsymbol{\theta}^*)] &\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta} + \frac{G^2\eta}{2} + \frac{G^2\eta}{2}\sum_{t=1}^{T-1}\frac{1}{T-t} \\ &\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2T\eta} + \frac{G^2\eta}{2} (2 + \ln T) \end{aligned} \end{equation} Taking \eta = \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert/G}{\sqrt{T(2+\ln T)}} minimizes the right-hand side, and its convergence rate is \mathcal{O}(\sqrt{\ln T/T}). This is slightly slower than the convergence rate of the average loss. In the previous post, we proved that under a constant learning rate, the convergence rate of the average loss can reach \mathcal{O}(1/\sqrt{T}). Of course, these are differences in extreme cases; in practice, the difference of \sqrt{\ln T} might be negligible.
Next, consider the dynamic learning rate \eta_t = \frac{\alpha}{\sqrt{t}}. Substituting this into equation [leq:last-1] gives: \begin{equation} \begin{aligned} \mathbb{E}[L(\boldsymbol{\theta}_T) - L(\boldsymbol{\theta}^*)] &\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2\alpha\sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha\sqrt{T}}{2}\sum_{t=1}^{T-1}\frac{1}{t(T-t)} \\ &= \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2\alpha\sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}}\sum_{t=1}^{T-1}\left(\frac{1}{t} + \frac{1}{T-t}\right) \\ &= \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2\alpha\sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha}{\sqrt{T}}\sum_{t=1}^{T-1}\frac{1}{t} \\ &\leq \frac{\Vert\boldsymbol{\theta}_1 - \boldsymbol{\theta}^*\Vert^2}{2\alpha\sqrt{T}} + \frac{G^2\alpha}{2\sqrt{T}} + \frac{G^2\alpha}{\sqrt{T}}(1 + \ln T) \\ &\sim \mathcal{O}\left(\frac{\ln T}{\sqrt{T}}\right) \end{aligned} \end{equation} Similar to the average loss convergence in unbounded domains from the previous article, under the dynamic learning rate \eta_t = \frac{\alpha}{\sqrt{t}}, the convergence rate is \mathcal{O}(\ln T / \sqrt{T}), although the constant here is larger.
Summary
In this article, we generalized the convergence conclusions of SGD from average loss to last-iterate loss, considering how close the loss value at the end of training is to the theoretical optimum. This setting is more aligned with our training practices.
Reprinting notice: Please include the original address of this article when reprinting: https://kexue.fm/archives/11480
For more detailed reprinting matters, please refer to: "Scientific Space FAQ"