Ever since I proposed the Tiger optimizer in “Tiger: An Extremely ‘Stingy’ Optimizer”, Tiger has become my “standard” optimizer for training models. Recently, I attempted to apply Tiger to the pre-training of a model with 7 billion parameters. The initial results looked promising, providing preliminary evidence that Tiger can indeed scale up. However, upon inspecting the trained model weights, I discovered some anomalies in the Embeddings: some components of the Embedding reached levels of \pm 100.
Through analysis, I found that similar phenomena do not occur with Adam. This is an issue specific to optimizers like Tiger or Lion that utilize the sign function \text{sign}. This article records my analysis process and provides two reference solutions at the end for your consideration.
Phenomena
The following analysis uses the Tiger optimizer as an example, but the process and conclusions are equally applicable to Lion.
First, the phenomena I observed are as follows:
1. Embedding components for some tokens have become \pm 100;
2. A small portion of other token Embedding components are trending towards \pm 100;
3. These tokens appear to be quite low-frequency tokens;
4. The maximum value of the entire Embedding matrix is exactly 100, and the minimum is -100;
5. Aside from the Embedding, other weights do not exhibit this problem;
6. The overall performance of the model (e.g., training loss, generation tests) is normal.
Some readers might ask: if the model performance is normal, why bother with it? In my view, there are at least two reasons. First, if one wishes to perform fine-tuning later, some low-frequency tokens might become high-frequency again; if their Embeddings are too poor, fine-tuning might not be able to recover them. Second, some capabilities are not reflected in the Loss. For instance, Chinese-English pre-trained models often exhibit certain multilingual capabilities because the training corpora are mixed with a very small amount of other languages. Clearly, this capability depends on the quality of low-frequency token Embeddings. If this capability is lost due to the optimizer, it would be a significant loss.
Of course, regardless of the optimizer, it is not surprising for a model to collapse during training, and it is often difficult to investigate deeply. However, what is most intriguing here is that the “collapse” is so regular—exactly \pm 100. This compelled me to further investigate the underlying cause.
Thinking
Based on the observations above, it can be preliminarily concluded that these anomalies only appear in the “Embeddings of low-frequency tokens.” This reminded me of the issue discussed in “Keras Implementation of Two Optimizers: Lookahead and LazyOptimizer”, where optimizers with momentum can lead to over-optimization of the Embedding layer.
Specifically, as long as a token has appeared once, the momentum corresponding to that token’s Embedding is updated to a non-zero value (assuming the gradient is not exactly zero). Consequently, in subsequent updates, even if the token does not appear in the current sample (gradient is zero), its Embedding is still updated because the momentum is non-zero. This is the over-optimization problem for low-frequency tokens. This issue occurs in all optimizers with momentum, including Adam and Tiger. However, in Adam, this might not be noticeably perceived because the update amount is proportional to the momentum. If a token does not reappear for a long time, the momentum decays exponentially and quickly approaches zero. In other words, the update amount also quickly approaches zero, and the over-optimization disappears.
However, the situation is different in Tiger. The update amount in Tiger is proportional to the sign function of the momentum, \text{sign}(\boldsymbol{m}_t). Although the momentum \boldsymbol{m}_t decays exponentially, the sign function does not. Until \boldsymbol{m}_t becomes zero due to rounding errors, \text{sign}(\boldsymbol{m}_t) maintains a constant value of \pm 1. This means the update amount remains a constant, making the over-optimization of Embeddings in Tiger much more severe. To make matters worse, after a token’s Embedding is biased in a certain direction due to over-optimization, its gradient might adapt to and encourage this change. That is, the next time it appears, the gradient might be in the same direction rather than the opposite, leading to long-term over-optimization in one direction and eventually resulting in anomalous values.
Calculation
So why are the anomalous values exactly \pm 100? This is where weight decay comes into play. The general optimization formula for Tiger is: \begin{equation} \boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \left[\text{sign}(\boldsymbol{m}_t) + \lambda \boldsymbol{\theta}_{t-1}\right] \end{equation} In other words, in addition to the sign function of the momentum, there is a weight decay term. In the anomalous experiment mentioned at the beginning, the decay rate \lambda was set to 0.01.
It is not difficult to see that if \text{sign}(\boldsymbol{m}_t) remains constant for a long time, the iteration formula above will have an equilibrium point. This occurs when \text{sign}(\boldsymbol{m}_t) + \lambda \boldsymbol{\theta}^* = \boldsymbol{0}, which means: \begin{equation} \boldsymbol{\theta}^* = -\frac{\text{sign}(\boldsymbol{m}_t)}{\lambda} \end{equation} With \lambda = 0.01, this corresponds exactly to a vector with elements of \pm 100. This explains the result of anomalous values being \pm 100. If interested, readers can also assume \eta_t is a constant and directly solve for the analytical expression of \boldsymbol{\theta}_t to further analyze the convergence speed, etc. I will not expand further on that here.
Countermeasures
Since the problem arises from the over-optimization of low-frequency token Embeddings, a natural solution is to make the Embedding updates “Lazy,” as suggested in “Keras Implementation of Two Optimizers: Lookahead and LazyOptimizer”. That is, only update the corresponding Embedding when the token actually appears. If one can obtain the set of all input Token IDs, one can directly update only those Embeddings. If not, we can determine whether an Embedding needs to be updated by checking if its gradient norm is non-zero.
On the other hand, from a more general perspective, this problem is a common flaw of Lion/Tiger optimizers for parameters with sparse gradients, including but not limited to the Embedding layer. Thus, another approach to solving the problem is to make the Embedding gradients non-sparse. For this, we can consider Tied Embeddings, where the input and output Embeddings are shared. Since the output layer reuses the entire Embedding matrix, the entire matrix will have non-zero gradients, preventing \boldsymbol{m}_t from remaining constant for a long time. Of course, Tied Embeddings might bring other issues; corresponding solutions can be found in “Re-exploring Shared Embeddings at the Output Layer of Language Models”. In my experiments, using Tied Embeddings that swap the channels of model features in half solved the above problem, and the performance seemed even slightly better than Untied Embeddings.
Finally, I consulted the author of the Lion optimizer regarding this issue. The response was that they had also noticed this problem previously. Their solution is to use a hybrid optimizer—for example, using Adam for the Embedding layer and Lion/Tiger only for other layers. Well, this was a solution I hadn’t considered; it doesn’t feel particularly elegant, but it does indeed solve the problem. Readers may choose for themselves.
Summary
This article introduced the phenomenon of Embedding anomalies under Lion/Tiger optimizer training, analyzed the underlying causes, and finally provided reference solutions.
When reposting, please
include the original address of this article: https://kexue.fm/archives/9736
For more detailed reposting matters, please refer to: “Scientific Space FAQ”