Why is Adam's Update RMS 0.2? · English (unofficial) translations of posts at kexue.fm

As is well known, we began experimenting with using Muon for large-scale LLM training quite early on. Specifically, in the post "Muon Sequel: Why We Chose to Try Muon?", we proposed the "Match Adam Update RMS" trick to facilitate a quick migration from Adam to Muon. This technique was also applied in the training of Kimi K2. The trick involves unifying Muon’s Update RMS to 0.2, which allows us to reuse Adam’s learning rate and weight decay rate.

Behind this trick is our observation that Adam’s Update RMS is approximately equal to 0.2, and this phenomenon is stable and reproducible. This raises an interesting question: Why is Adam’s Update RMS 0.2? Can we explain it theoretically?

Introduction to the Problem

First, let’s describe the phenomenon: from experiments, we observe that roughly after the warmup ends and the model enters formal training, Adam’s Update RMS almost always stays between 0.2 and 0.3. This pattern is consistent across models of different sizes. The commonality among these models is that they are all trained with Adam using parameters \beta_1=0.9, \beta_2=0.95. Since this commonality is so obvious, it is likely not a coincidence. Therefore, I attempted to analyze the underlying principle.

Let’s review the form of the Adam optimizer: \begin{equation} \text{Adam}\textcolor{cyan}{\text{W}} := \left\{ \begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + (1 - \beta_1) \boldsymbol{g}_t \\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + (1 - \beta_2) \boldsymbol{g}_t^2 \\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t / (1 - \beta_1^t) \\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t / (1 - \beta_2^t) \\ &\boldsymbol{u}_t = \hat{\boldsymbol{m}}_t / (\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon) \\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t (\boldsymbol{u}_t \textcolor{cyan}{+ \lambda_t \boldsymbol{\theta}_{t-1}}) \end{aligned} \right. \end{equation} Note: In this article, all vector multiplications and divisions, including squares, refer to the Hadamard product/quotient (element-wise operations).

Our goal is to prove that \|\boldsymbol{u}_t\|_{RMS} \approx 0.2, at least under the setting \beta_1=0.9, \beta_2=0.95. We assume \epsilon is small enough to be ignored, and we consider the steady state as t \to \infty. In this state, \beta_1^t and \beta_2^t are close to zero, so we do not need to distinguish between \boldsymbol{m}_t and \hat{\boldsymbol{m}}_t, or \boldsymbol{v}_t and \hat{\boldsymbol{v}}_t. Thus, we have \boldsymbol{u}_t = \boldsymbol{m}_t / \sqrt{\boldsymbol{v}_t}.

For \boldsymbol{m}_t and \boldsymbol{v}_t, we can obtain the expansion: \begin{equation} \boldsymbol{m}_t = (1 - \beta_1)\sum_{i=1}^t \beta_1^{t-i}\boldsymbol{g}_i, \qquad \boldsymbol{v}_t = (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\boldsymbol{g}_i^2 \end{equation}

Numerical Simulation

If we assume that \boldsymbol{g}_1, \boldsymbol{g}_2, \dots, \boldsymbol{g}_t are sampled from the same distribution, we can directly use numerical simulation to estimate \|\boldsymbol{u}_t\|_{RMS}. Let’s try the simplest case using the standard normal distribution \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}). The reference code is as follows:

import numpy as np

N, T = 10000, 2000
beta1, beta2 = 0.9, 0.95
m, v = 0, 0
for t in range(1, T + 1):
    g = np.random.randn(N)
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g**2
    u = m / v**0.5

rms = (u**2).mean()**0.5
print(rms)

Can you guess the result? The answer is approximately 0.225, which is strikingly similar to the experimental results! This suggests that our simulation assumptions align well with reality. Some readers might think this is incorrect—isn’t \boldsymbol{g} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) just pure noise? How can it match? Actual training is certainly not pure noise, but we can say that the signal-to-noise ratio (SNR) of a single gradient is extremely small, so it can be modeled as pure noise.

Readers can experiment with the code above to observe the variables affecting Update RMS. The general conclusions are: Update RMS is positively correlated with \beta_1, seems largely independent of \beta_2, and if the distribution of \boldsymbol{g} has a non-zero mean (equivalent to increasing the SNR of the gradient), the Update RMS will increase.

Mean-Field Approximation

In this section, I will attempt to derive an approximate analytical solution for the simulation results. First, from the definition of RMS, to find \|\boldsymbol{u}_t\|_{RMS}, we first need \boldsymbol{u}_t^2 = \boldsymbol{m}_t^2 / \boldsymbol{v}_t. My idea is to use the expectation of \boldsymbol{u}_t^2 as an approximation and further transform it into a mean-field approximation: \begin{equation} \mathbb{E}[\boldsymbol{u}_t^2] = \mathbb{E}\left[\frac{\boldsymbol{m}_t^2}{\boldsymbol{v}_t}\right] \approx \frac{\mathbb{E}[\boldsymbol{m}_t^2]}{\mathbb{E}[\boldsymbol{v}_t]} \end{equation} Some readers might question the validity of the last approximation step. My suggestion is to ignore these details for now—much like assuming \boldsymbol{g} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) in the previous section—and calculate it first. If the result is reasonable, then the process must be reasonable to some extent. Now we calculate the numerator and denominator separately. We generally set \mathbb{E}[\boldsymbol{g}] = \boldsymbol{\mu} and \mathbb{E}[\boldsymbol{g}^2] = \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2. The denominator is simpler: \begin{equation} \begin{aligned} \mathbb{E}[\boldsymbol{v}_t] &= (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\mathbb{E}[\boldsymbol{g}_i^2] \\ &= (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}(\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\ &= (1 - \beta_2^t) (\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\[5pt] &\approx \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2 \qquad (t \to \infty) \end{aligned} \end{equation} As for the numerator, we can expand the square directly or take a shortcut: we need the second moment of \boldsymbol{m}_t, \mathbb{E}[\boldsymbol{m}_t^2], which equals \mathbb{E}[\boldsymbol{m}_t]^2 + \text{Var}[\boldsymbol{m}_t]. The calculation for \mathbb{E}[\boldsymbol{m}_t] is similar to \mathbb{E}[\boldsymbol{v}_t], resulting in (1 - \beta_1^t)\boldsymbol{\mu} \approx \boldsymbol{\mu}. As for the variance, it possesses square additivity, so: \begin{equation} \text{Var}[\boldsymbol{m}_t] = (1 - \beta_1)^2\sum_{i=1}^t \beta_1^{2(t-i)}\boldsymbol{\sigma}^2 = \frac{(1 - \beta_1)^2 (1 - \beta_1^{2t})}{1 - \beta_1^2}\boldsymbol{\sigma}^2 \approx \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2 \qquad (t \to \infty) \end{equation} Therefore: \begin{equation} \mathbb{E}[\boldsymbol{u}_t^2] \approx \frac{\boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2}{\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2} \end{equation}

Analysis of Results

Since \mathbb{E}[\boldsymbol{u}_t^2] is already a squared vector, to estimate \|\boldsymbol{u}_t\|_{RMS}, we only need to average the components and take the square root. For the averaging step, let’s apply the mean-field approximation again (averaging the numerator and denominator separately), which yields: \begin{equation} \|\boldsymbol{u}_t\|_{RMS} \approx \sqrt{\frac{\|\boldsymbol{\mu}\|^2 + \frac{1 - \beta_1}{1 + \beta_1}\|\boldsymbol{\sigma}\|^2}{\|\boldsymbol{\mu}\|^2 + \|\boldsymbol{\sigma}\|^2}} = \sqrt{\frac{\|\boldsymbol{\mu}\|^2/\|\boldsymbol{\sigma}\|^2 + \frac{1 - \beta_1}{1 + \beta_1}}{\|\boldsymbol{\mu}\|^2/\|\boldsymbol{\sigma}\|^2 + 1}} \label{eq:mean-field} \end{equation} There are two influencing factors: first, \|\boldsymbol{\mu}\|^2/\|\boldsymbol{\sigma}\|^2, which can be viewed as the Signal-to-Noise Ratio (SNR) of the gradient; second, \beta_1, which is one of Adam’s hyperparameters. Notably, the result does not depend on \beta_2, which matches our previous simulation. How good is this approximation? Let’s consider the simplest case where \boldsymbol{\mu}=\boldsymbol{0}: \begin{equation} \|\boldsymbol{u}_t\|_{RMS} \approx \sqrt{\frac{1 - \beta_1}{1 + \beta_1}} \end{equation} Substituting \beta_1=0.9, the result is 0.2294\dots, which matches both simulation results and practical performance very well! Furthermore, here are several comparisons with simulation results:

[Click to view original SVG: Simulation results vs Mean-field approximation (different beta1, beta2)]

Simulation results vs Mean-field approximation (different \beta_1, \beta_2)

Overall, the degree of approximation is quite good, especially for \beta_2 \geq 0.9, where the results almost overlap with the mean-field approximation (as pointed out by @EIFY, the paper "Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks" also obtained the same calculation results).

The comparison considering SNR is as follows:

[Click to view original SVG: Simulation results vs Mean-field approximation (different beta1, SNR)]

Simulation results vs Mean-field approximation (different \beta_1, SNR)

As the SNR increases, the error of the mean-field approximation begins to grow, but it still predicts the overall trend. In fact, in actual training, the gradient SNR rarely reaches as high as 1, so the mean-field approximation remains a good estimate.

Inverse Prediction

If we accept the mean-field approximation [eq:mean-field], we can use it in reverse to estimate the gradient SNR: \begin{equation} \frac{\|\boldsymbol{\mu}\|^2}{\|\boldsymbol{\sigma}\|^2} \approx \frac{\|\boldsymbol{u}_t\|_{RMS}^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \|\boldsymbol{u}_t\|_{RMS}^2} \end{equation} In actual training, \beta_1 is given, and \|\boldsymbol{u}_t\|_{RMS} (Adam’s Update RMS) can be directly estimated, so the above formula is computable. Of course, this formula only applies to Adam. Is there a more general estimation approach? Yes! Recall our earlier estimate: \begin{equation} \mathbb{E}[\boldsymbol{m}_t^2] \approx \boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2 \end{equation} Summing its components and taking the square root, we consider it an approximation of \|\boldsymbol{m}_t\|: \begin{equation} \|\boldsymbol{m}_t\| \approx \sqrt{\|\boldsymbol{\mu}\|^2 + \frac{1 - \beta_1}{1 + \beta_1}\|\boldsymbol{\sigma}\|^2} \end{equation} The second moment is \mathbb{E}[\boldsymbol{v}_t] \approx \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2. While optimizers like Muon do not have a second moment available, we notice that the second moment result is independent of \beta_2. Thus, we can consider the simplest case—\beta_2=0—where \boldsymbol{v}_t = \boldsymbol{g}_t^2. This might be a bit forced, but for estimation, we choose the most convenient path. This "approximation" implies \|\boldsymbol{g}_t\|^2 \approx \|\boldsymbol{\mu}\|^2 + \|\boldsymbol{\sigma}\|^2, leading to: \begin{equation} \frac{\|\boldsymbol{m}_t\|}{\|\boldsymbol{g}_t\|} \approx \sqrt{\frac{\|\boldsymbol{\mu}\|^2 + \frac{1 - \beta_1}{1 + \beta_1}\|\boldsymbol{\sigma}\|^2}{\|\boldsymbol{\mu}\|^2 + \|\boldsymbol{\sigma}\|^2}} \end{equation} The right side is identical in form to equation [eq:mean-field], so we can write: \begin{equation} \frac{\|\boldsymbol{\mu}\|^2}{\|\boldsymbol{\sigma}\|^2} \approx \frac{\|\boldsymbol{m}_t\|^2/\|\boldsymbol{g}_t\|^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \|\boldsymbol{m}_t\|^2/\|\boldsymbol{g}_t\|^2} \end{equation} In other words, using \|\boldsymbol{m}_t\|/\|\boldsymbol{g}_t\| to replace \|\boldsymbol{u}_t\|_{RMS} provides a general way to estimate \|\boldsymbol{\mu}\|^2/\|\boldsymbol{\sigma}\|^2 for optimizers with momentum. Some might ask what to do if there is no momentum? In that case, there is truly no way, because \|\boldsymbol{\mu}\|^2/\|\boldsymbol{\sigma}\|^2 is a statistic across optimization trajectories; we must have some cross-trajectory statistical information to estimate it.

Summary

This article explored Adam’s Update RMS from both simulation experiments and theoretical approximations. This serves as one of the theoretical foundations for aligning the Update RMS to 0.2 in the Muon optimizer.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/11267

For more details on reprinting, please refer to: "Scientific Space FAQ"