Muon Optimizer Guide: Quick Start and Key Details · English (unofficial) translations of posts at kexue.fm

By now, many readers have likely come across news regarding the Muon optimizer. In fact, Muon was proposed about a year ago, in October last year, by Keller Jordan on Twitter. However, within just this one year, Muon has already undergone the test of training models with tens of billions, hundreds of billions, and even trillions of parameters, demonstrating that it is a highly competitive optimizer.

Today, Muon is built into training frameworks such as PyTorch and Keras, and even large-scale frameworks like Megatron have gradually begun to support it. This signifies widespread industry recognition. However, for readers only familiar with Adam, switching to Muon quickly and effectively might still be confusing. Therefore, this article attempts to provide a quick-start tutorial.

Brief Introduction

The formal proposer of Muon is Keller Jordan, who currently works at OpenAI. As mentioned at the beginning, Muon was first published on Twitter. Even now, the author has only written a blog post, "Muon: An optimizer for hidden layers in neural networks", rather than a formal paper. The author’s view is that “whether it is written as a paper has nothing to do with whether the optimizer is effective”¹.

Muon is an optimizer specifically tailored for matrix parameters. There are other related works with similar characteristics, such as Shampoo and the earlier Stochastic Spectral Descent, among others. Many works relate to Muon to some extent, but none completely cover it. In my view, Muon stands as a brand-new contribution.

In China, the earliest article popularizing Muon was likely my blog post “Appreciating the Muon Optimizer: An Essential Leap from Vectors to Matrices”. The first validation of Muon on a relatively large scale was likely our February release of Moonlight. The “Moonlight version” of Muon proposed there was later used in the trillion-parameter K2 model. Following K2, GLM-4.5 also utilized this Muon variant.

As Jeremy Bernstein, one of Muon’s authors, stated in his blog “Deriving Muon”, the uniqueness of Muon lies in the fact that it can be derived from more fundamental optimization principles and is effective in practice. In contrast, while Adam is also very effective, it feels more like a heuristic solution.

Four Versions

I do not intend to introduce the mathematical details or the implementation of Muon here. Instead, I will focus on the technical details and precautions for switching from Adam to Muon. As mentioned, Muon is dedicated to matrix parameter optimization and uses a non-element-wise update rule, which can be confusing for new users.

Furthermore, to my knowledge, there are at least four slightly different versions of Muon currently in existence. This multi-version phenomenon exacerbates the confusion. If users do not understand the details, they might achieve poor results due to incorrect hyperparameter tuning (especially the learning rate). Below, I will clarify these versions. For a matrix \boldsymbol{W} \in \mathbb{R}^{d_{in} \times d_{out}} with gradient \boldsymbol{G}, the four Muon variants are:

\begin{equation*} \begin{aligned} &\quad\boldsymbol{M}_t \quad=\quad \beta \boldsymbol{M}_{t-1} + \boldsymbol{G}_t \\[10pt] &\quad\boldsymbol{W}_t = \begin{cases} \boldsymbol{W}_{t-1} - \eta_t \left(\mathop{\mathrm{msign}}(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) & \text{\color{cyan}(Naive version)} \\[5pt] \boldsymbol{W}_{t-1} - \eta_t \left(\sqrt{\max(1, d_{out}/d_{in})}\mathop{\mathrm{msign}}(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) & \text{\color{cyan}(Keller Jordan version)} \\[5pt] \boldsymbol{W}_{t-1} - \eta_t \left(\sqrt{ d_{out}/d_{in}}\mathop{\mathrm{msign}}(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) & \text{\color{cyan}(MuP version)} \\[5pt] \boldsymbol{W}_{t-1} - \eta_t \left(0.2\times\sqrt{\max(d_{out},d_{in})}\mathop{\mathrm{msign}}(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) & \text{\color{cyan}(Moonlight version)} \end{cases} \end{aligned} \end{equation*}

If Nesterov momentum is enabled, \mathop{\mathrm{msign}}(\boldsymbol{M}_t) is replaced by \mathop{\mathrm{msign}}(\beta\boldsymbol{M}_t + \boldsymbol{G}_t). In implementations, \mathop{\mathrm{msign}} is usually named zeropower_via_newtonschulz; general users do not need to worry about the specific implementation details.

The only difference between the four versions is the scaling factor before \mathop{\mathrm{msign}}. The “Keller Jordan version” and the “MuP version” are very similar, while the “Moonlight version” is slightly more unique. Keras only implements the “Keller Jordan version,” while Torch implements both the “Keller Jordan version” and the “Moonlight version.” The Naive version seems relatively rare now. Personally, I often use my own implementation of the “MuP version.”

Two Dimensions

An important detail to note is that the “Keller Jordan version” and the “MuP version” are sensitive to the order of d_{in} and d_{out}. Therefore, the first task is to clarify the meaning of d_{in} and d_{out}. It is not necessarily the case that the first dimension of a matrix is d_{in} and the second is d_{out}.

d_{in} and d_{out} represent the input and output dimensions of a linear layer, respectively. Which is which depends on the specific implementation of the linear layer. For example, Keras’s Dense layer is implemented as \boldsymbol{x}\boldsymbol{W}, so the first dimension of matrix \boldsymbol{W} is d_{in} and the second is d_{out}. However, PyTorch’s Linear layer implements \boldsymbol{x}\boldsymbol{W}^{\top}, so the second dimension of matrix \boldsymbol{W} is d_{in}, and the first dimension is d_{out}.

Therefore, if you want to implement the “Keller Jordan version” of Muon for a PyTorch Linear layer, the scaling factor should be max(1, W.shape[0]/W.shape[1])**0.5. For Keras, it should be max(1, W.shape[1]/W.shape[0])**0.5. Consequently, the current Keras (v3.12) implementation of Muon is actually incorrect because it copied the scaling factor implementation directly from PyTorch².

If you are writing your own model, you must judge carefully based on your implementation. For instance, it is possible to mix built-in Linear layers with manual x @ W operations in PyTorch, in which case you cannot generalize whether to use W.shape[0]/W.shape[1] or W.shape[1]/W.shape[0]. Of course, if you find this too troublesome, you can consider using the “Moonlight version,” as its scaling factor is symmetric with respect to d_{in} and d_{out}.

Hyperparameter Settings

After clarifying d_{in} and d_{out}, the remaining task is setting the learning rate \eta_t and the weight decay coefficient \lambda. The assumption here is that the user already has experience tuning Adam and has achieved good results, and now wants to quickly migrate to Muon.

Let’s look at the “Moonlight version” first. Its scaling factor was obtained by aligning with Adam’s Update RMS. For more details, refer to “Muon Sequel: Why Did We Choose to Try Muon?”. Regarding the “Magic Number” 0.2, you can refer to “Why is Adam’s Update RMS 0.2?”. Simply put, the “Moonlight version” of Muon aligns with Adam’s update magnitude, so the simplest way to migrate from Adam is: change nothing—just use Adam’s \eta_t and \lambda.

Now for the other three versions. We know that mainstream models usually have a hidden_size (denoted as d), and most matrix shapes will not deviate significantly from d \times d. Therefore, we approximate d_{in} = d_{out} = d. In this case, these three versions are identical, but they lack the 0.2\sqrt{d} factor compared to the “Moonlight version.” Since the “Moonlight version” aligns with Adam’s update magnitude without changing hyperparameters, the learning rates for these three versions should be scaled up by 0.2\sqrt{d} to align with Adam, and \lambda should be divided by 0.2\sqrt{d}.

Substituting d = 1024, 2048, 4096, the results are approximately 6.4, 9, 12.8. If you cannot remember 0.2\sqrt{d}, simply remember that if using the other three versions of Muon, **directly scale Adam’s learning rate by 10 times** to use as the Muon learning rate. If you directly apply Adam’s learning rate to Muon, you will likely conclude that Muon is far inferior to Adam due to underfitting. To my knowledge, some negative reviews of Muon stem from this.

Does this mean the “Moonlight version” is better? It certainly has good practical effects, but saying it is “better” is evaluating it from an Adam-centric perspective. The advantage of the “MuP version” or “Keller Jordan version” is learning rate transferability; that is, a learning rate tuned on a small model often works well when directly applied to a large model. For this, refer to Jeremy Bernstein’s blog “Deriving Muon” or my blog “Higher-order MuP: Simpler but Smarter Spectral Condition Scaling”.

Other Parameters

If Muon only handles matrix parameters, what about the rest? For example, the Bias terms in linear layers, the gamma terms in RMSNorm—these are 1D parameters. Or convolutional layers, which may have 3D or 4D parameter arrays.

First, let me correct something: Muon does not just handle any matrix parameters; it specifically handles **matrix parameters of linear layers with dense inputs**. If this sounds confusing, just remember that the matrix parameters of Embedding layers and the final classification layer (including GPT’s LM Head) should not use Muon, otherwise performance will significantly degrade. For these parameters that cannot use Muon, as well as 1D, 3D, and higher-dimensional parameters, if you don’t want to overthink it, just use Adam. Most Muon implementations are hybrid with Adam, allowing users to select specific layers for Adam.

If you are willing to experiment, 3D and 4D parameters like those in convolutional layers can also use Muon. Taking Conv2D as an example, the kernel shape is usually (w, h, d_{in}, d_{out}). Its equivalent implementation involves flattening the (w, h, d_{in}) patch input into a vector of size w \times h \times d_{in}, and reshaping the kernel to (w \times h \times d_{in}, d_{out}) for matrix multiplication. To use Muon here, you would reshape the momentum to (w \times h \times d_{in}, d_{out}), calculate \mathop{\mathrm{msign}}, and then reshape it back for the update.

Similarly, the gamma parameter of RMSNorm can be viewed as multiplication by a diagonal matrix. By treating its momentum as a diagonal matrix, one can calculate \mathop{\mathrm{msign}}, which is equivalent to SignSGDM. The Embedding layer can be viewed as multiple (1, d) matrices for calculating \mathop{\mathrm{msign}}, resulting in Normalized SGDM (refer to “Appreciating the Muon Optimizer: An Essential Leap from Vectors to Matrices”). If you want to go further, you could consider whether each head’s projection matrix in Multi-Head Attention should be handled separately for \mathop{\mathrm{msign}}...

Life is short; keep on tinkering!

Expected Results

Finally, if the user has correctly set up and run the optimizer according to the above instructions, they can begin to pray for the favor of the goddess of luck.

What kind of results should we expect? If no anomalies like gradient explosions occur, Muon will, in most cases, be slightly better than Adam. Of course, it is not ruled out that Muon might be slightly worse in some cases, but either way, the gap should not be very large. If one significantly outperforms the other, you might need to reconsider if there is an issue with the settings on either side.

However, nothing is absolute. For instance, under certain extreme settings, Muon can indeed perform much better than Adam, where Adam fails no matter how it is tuned. In short, good luck. If you encounter interesting phenomena, feel free to exchange and analyze them together.

When reposting, please include the original address of this article: https://kexue.fm/archives/11416

For more detailed reposting matters, please refer to: Scientific Space FAQ