Beyond MuP: 1. Three Characteristics of a Good Model · English (unofficial) translations of posts at kexue.fm

Foreword

In terms of chronological order, MuP came before Muon. However, my learning order was exactly the opposite: I first studied Muon and then MuP. In hindsight, this turned out to be a good learning sequence.

In articles such as "Appreciation of the Muon Optimizer: A Fundamental Leap from Vectors to Matrices" and "Muon Sequel: Why We Choose to Try Muon?", we described Muon as "steepest descent under spectral norm constraints." The MuP series of work happens to explain "why spectral norm constraints are needed," perfectly bridging the two.

I should clarify that the "MuP" we refer to has two meanings. First, there is the "Elementary MuP" introduced in "A First Look at MuP: Cross-Model Scaling Laws for Hyperparameters", which is part of the Tensor Programs series. Second, there is the "High-order MuP" introduced in "High-order MuP: Simpler but Smarter Spectral Condition Scaling", which yields richer conclusions in a more concise manner. Both are the work of Greg Yang (salute to the master).

In this article, unless otherwise specified, MuP refers to "High-order MuP." In fact, this series of articles, which I call "Beyond MuP," consists of a series of reflections and extensions based on High-order MuP. However, for some readers who only know the "Elementary MuP" from the Tensor Programs series, they might initially wonder how MuP can answer "why spectral norms are needed."

Regardless, I will try to make this series self-contained. Although I will mention many related papers or blogs during the introduction, readers do not need to read all of them in depth.

Seeking Speed within Stability

Back to the main topic again. As the first article, the task of this post is to set the core objective—specifically, to think clearly about "what kind of model we actually want" and "how we can train such a model."

Intuitively, as long as the model shows no signs of collapsing, we can keep training it until it converges to a satisfactory result. On this basis, we then try to find ways to make the model converge faster. So, to put it simply, it is just about two things: "stability" and "speed," or rather, "seeking speed within stability." So, how do we judge if a model is stable? This naturally involves monitoring various "internal health indicators"; the more you monitor, the more problems you can expose.

However, I do not intend to list various internal indicators here. Instead, I will try to find the most core or necessary conditions. To this end, let us first define a concept—RMS (Root Mean Square): Let \boldsymbol{x}=(x_1,x_2,\cdots,x_d)\in\mathbb{R}^d, then we define: \begin{equation} \Vert\boldsymbol{x}\Vert_{RMS} = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2} = \frac{\Vert\boldsymbol{x}\Vert_2}{\sqrt{d}} \end{equation} It represents the average scale of each element, differing from the vector norm \Vert\boldsymbol{x}\Vert_2 by a factor of \sqrt{d}.

Some readers might ask: since it’s just a factor difference, why not just observe the norm instead of defining a new concept? There are several considerations. For example, RMSNorm is frequently used, and RMS is easier to perceive than the norm. Furthermore, an important reason is that most activation functions are element-wise, so we need to examine and control the scale averaged to each element to ensure that activation functions play similar roles in different models.

Three Conditions

With the RMS notation, I can write down what I consider to be the three most necessary conditions for stably training a good model: \begin{align} &\text{Forward Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c1}\\[5pt] &\text{Dependency Stability:}\quad\max_{\boldsymbol{x}_1,\boldsymbol{x}_2} \Vert \boldsymbol{f}(\boldsymbol{x}_1;\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x}_2;\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c2}\\[5pt] &\text{Update Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega} + \Delta\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c3} \end{align} Where \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega}) represents a family of models mapping \mathbb{R}^{d_{in}}\mapsto \mathbb{R}^{d_{out}}, with input \boldsymbol{x}\in\mathbb{R}^{d_{in}}, output \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\in\mathbb{R}^{d_{out}}, and model parameters \boldsymbol{\omega} (which can be scalars, vectors, matrices, etc.). \mathcal{\Theta} is the "Big Theta Notation." Here, \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega}) can be a layer, a block composed of several layers, or even the entire model. Theoretically, the coarser the granularity, the looser or more accurate the resulting constraints, but solving the \max also becomes more difficult; thus, it depends on our ability to calculate the \max.

Among the three equations, Equation [eq:c1] is probably the easiest to understand. It represents the stability of the forward pass. After taking the \max over \boldsymbol{x}, the only remaining variable is \boldsymbol{\omega}, so this is a constraint on \boldsymbol{\omega}. Note that we have not restricted the values of \boldsymbol{x} here, so by default \boldsymbol{x}\in\mathbb{R}^{d_{in}}, meaning the maximum might not exist. For example, for a non-zero \boldsymbol{W}, \max\limits_{\boldsymbol{x}}\Vert \boldsymbol{x}\boldsymbol{W}\Vert_{RMS}\to\infty.

To ensure the existence of the maximum, we usually add some Normalization operations, such as: \begin{align} &\text{Pre Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x})\boldsymbol{W} \\[5pt] &\text{Post Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x}\boldsymbol{W}) \end{align} where \mathop{\text{RMSNorm}}(\boldsymbol{x})=\boldsymbol{x}/\Vert\boldsymbol{x}\Vert_{RMS}. Therefore, condition [eq:c1] also implicitly places some requirements on the model architecture. Similarly, Equation [eq:c2] requires that the model architecture depends smoothly on the input. For a simple example, f(x;\omega)=x\times\omega\times 0 + 1; the forward pass of this "model" is naturally very stable, but it does not depend on x at all, so Equation [eq:c2] cannot be satisfied. Thus, it is not a good model.

Finally, Equation [eq:c3] should also be easy to understand. After taking the \max over \boldsymbol{x}, the result is a constraint on \boldsymbol{\omega} and \Delta\boldsymbol{\omega}. it primarily focuses on the impact of the increment \Delta\boldsymbol{\omega}, representing the expectation for training stability. We can use it to guide the hyperparameter settings of the optimizer, or even use it to construct new optimizers.

Related Reflections

In summary, the three conditions in Equations [eq:c1], [eq:c2], and [eq:c3] integrate considerations of model architecture, initialization, and optimizers. It is hard to say that any one condition can be removed, so I believe they are all necessary. Of course, regarding these three conditions, there are some details worth discussing, such as the choice between \max and \mathbb{E}.

In the current equations, we "eliminate" \boldsymbol{x} by taking the \max, obtaining expressions involving only \boldsymbol{\omega} and \Delta\boldsymbol{\omega}. Some might have questions about this; for some readers, a more intuitive approach might be to take the mathematical expectation \mathbb{E}_{\boldsymbol{x}}. Why \max and not \mathbb{E}? There are several reasons. First, calculating \max only requires defining the domain of \boldsymbol{x}, while calculating \mathbb{E} requires defining the distribution of \boldsymbol{x}. Different distributions yield different results, and accurately defining this distribution is not a trivial matter.

Secondly, \max has the advantage of being invariant under monotonic transformations, whereas \mathbb{E} does not. For example, for \max, we have the identity (\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS})^2 = \max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2. That is, whether we take the \max of \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} or \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2, it is essentially the same. But this is not the case for \mathbb{E}; the expectation of \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} and the expectation of \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2 usually differ in computational difficulty, and their results are not necessarily related.

Therefore, \max is simpler in both concept and properties. A possible concern is whether \max is too strict, acting as a "sufficient but not necessary" condition. In fact, \max is just an intuitive term; mathematically, it is called the "supremum (\sup)." The word "least upper bound" indicates that this value is achievable and tight. In practice, the mean and the maximum are often of the same order of magnitude, and since our goal is only \mathcal{\Theta}(1), the difference is not significant. Conversely, \max accounts for extreme cases, ensuring training stability to the greatest extent possible, which is particularly important for training large models like LLMs.

In fact, Elementary MuP, or the Tensor Programs series, is based on a series of analyses using \mathbb{E}, while High-order MuP, like this article, is based on \max. In hindsight, analyses based on \mathbb{E} are inferior to High-order MuP based on \max in terms of computational simplicity and the generality of results, which in turn validates the effectiveness of \max.

Summary

Starting from this article, I will share some top-down understandings of model optimization, which are extensions and expansions based on the previous "High-order MuP." As the first article, we mainly described three basic conditions for model stability, or the three characteristics of a good model, which will serve as the foundation for subsequent calculations and analysis.

Original address: https://kexue.fm/archives/11340