I wonder if you have noticed an interesting detail: both Muon and MuP start with "Mu," but the original meanings of the two "Mu"s are completely different. The former stands for "MomentUm Orthogonalized by Newton-Schulz," while the latter stands for "Maximal Update Parametrization." Yet, there is indeed a very profound connection between them. In other words, Muon and MuP have completely different starting points, but they ultimately move in the same direction and even unintentionally adopted similar names, as if it were truly "meant to be."
Back to the main topic. In short, through various coincidences, I happened to learn about Muon and MuP at the same time. This has greatly deepened my understanding of model optimization and led me to think about more fundamental principles of optimization. After a period of trial and error, I have gained some modest insights, which I would like to share with you here.
Foreword
In terms of chronological order, MuP came before Muon. However, my learning order was exactly the opposite: I first studied Muon and then MuP. In hindsight, this turned out to be a good learning sequence.
In articles such as "Appreciation of the Muon Optimizer: A Fundamental Leap from Vectors to Matrices" and "Muon Sequel: Why We Choose to Try Muon?", we described Muon as "steepest descent under spectral norm constraints." The MuP series of work happens to explain "why spectral norm constraints are needed," perfectly bridging the two.
I should clarify that the "MuP" we refer to has two meanings. First, there is the "Elementary MuP" introduced in "A First Look at MuP: Cross-Model Scaling Laws for Hyperparameters", which is part of the Tensor Programs series. Second, there is the "High-order MuP" introduced in "High-order MuP: Simpler but Smarter Spectral Condition Scaling", which yields richer conclusions in a more concise manner. Both are the work of Greg Yang (salute to the master).
In this article, unless otherwise specified, MuP refers to "High-order MuP." In fact, this series of articles, which I call "Beyond MuP," consists of a series of reflections and extensions based on High-order MuP. However, for some readers who only know the "Elementary MuP" from the Tensor Programs series, they might initially wonder how MuP can answer "why spectral norms are needed."
Regardless, I will try to make this series self-contained. Although I will mention many related papers or blogs during the introduction, readers do not need to read all of them in depth.
Seeking Speed within Stability
Back to the main topic again. As the first article, the task of this post is to set the core objective—specifically, to think clearly about "what kind of model we actually want" and "how we can train such a model."
Intuitively, as long as the model shows no signs of collapsing, we can keep training it until it converges to a satisfactory result. On this basis, we then try to find ways to make the model converge faster. So, to put it simply, it is just about two things: "stability" and "speed," or rather, "seeking speed within stability." So, how do we judge if a model is stable? This naturally involves monitoring various "internal health indicators"; the more you monitor, the more problems you can expose.
However, I do not intend to list various internal indicators here. Instead, I will try to find the most core or necessary conditions. To this end, let us first define a concept—RMS (Root Mean Square): Let \boldsymbol{x}=(x_1,x_2,\cdots,x_d)\in\mathbb{R}^d, then we define: \begin{equation} \Vert\boldsymbol{x}\Vert_{RMS} = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2} = \frac{\Vert\boldsymbol{x}\Vert_2}{\sqrt{d}} \end{equation} It represents the average scale of each element, differing from the vector norm \Vert\boldsymbol{x}\Vert_2 by a factor of \sqrt{d}.
Some readers might ask: since it’s just a factor difference, why not just observe the norm instead of defining a new concept? There are several considerations. For example, RMSNorm is frequently used, and RMS is easier to perceive than the norm. Furthermore, an important reason is that most activation functions are element-wise, so we need to examine and control the scale averaged to each element to ensure that activation functions play similar roles in different models.
Three Conditions
With the RMS notation, I can write down what I consider to be the three most necessary conditions for stably training a good model: \begin{align} &\text{Forward Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c1}\\[5pt] &\text{Dependency Stability:}\quad\max_{\boldsymbol{x}_1,\boldsymbol{x}_2} \Vert \boldsymbol{f}(\boldsymbol{x}_1;\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x}_2;\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c2}\\[5pt] &\text{Update Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega} + \Delta\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c3} \end{align} Where \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega}) represents a family of models mapping \mathbb{R}^{d_{in}}\mapsto \mathbb{R}^{d_{out}}, with input \boldsymbol{x}\in\mathbb{R}^{d_{in}}, output \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\in\mathbb{R}^{d_{out}}, and model parameters \boldsymbol{\omega} (which can be scalars, vectors, matrices, etc.). \mathcal{\Theta} is the "Big Theta Notation." Here, \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega}) can be a layer, a block composed of several layers, or even the entire model. Theoretically, the coarser the granularity, the looser or more accurate the resulting constraints, but solving the \max also becomes more difficult; thus, it depends on our ability to calculate the \max.
Among the three equations, Equation [eq:c1] is probably the easiest to understand. It represents the stability of the forward pass. After taking the \max over \boldsymbol{x}, the only remaining variable is \boldsymbol{\omega}, so this is a constraint on \boldsymbol{\omega}. Note that we have not restricted the values of \boldsymbol{x} here, so by default \boldsymbol{x}\in\mathbb{R}^{d_{in}}, meaning the maximum might not exist. For example, for a non-zero \boldsymbol{W}, \max\limits_{\boldsymbol{x}}\Vert \boldsymbol{x}\boldsymbol{W}\Vert_{RMS}\to\infty.
To ensure the existence of the maximum, we usually add some Normalization operations, such as: \begin{align} &\text{Pre Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x})\boldsymbol{W} \\[5pt] &\text{Post Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x}\boldsymbol{W}) \end{align} where \mathop{\text{RMSNorm}}(\boldsymbol{x})=\boldsymbol{x}/\Vert\boldsymbol{x}\Vert_{RMS}. Therefore, condition [eq:c1] also implicitly places some requirements on the model architecture. Similarly, Equation [eq:c2] requires that the model architecture depends smoothly on the input. For a simple example, f(x;\omega)=x\times\omega\times 0 + 1; the forward pass of this "model" is naturally very stable, but it does not depend on x at all, so Equation [eq:c2] cannot be satisfied. Thus, it is not a good model.
Finally, Equation [eq:c3] should also be easy to understand. After taking the \max over \boldsymbol{x}, the result is a constraint on \boldsymbol{\omega} and \Delta\boldsymbol{\omega}. it primarily focuses on the impact of the increment \Delta\boldsymbol{\omega}, representing the expectation for training stability. We can use it to guide the hyperparameter settings of the optimizer, or even use it to construct new optimizers.
Summary
Starting from this article, I will share some top-down understandings of model optimization, which are extensions and expansions based on the previous "High-order MuP." As the first article, we mainly described three basic conditions for model stability, or the three characteristics of a good model, which will serve as the foundation for subsequent calculations and analysis.
Original address: https://kexue.fm/archives/11340