Muon Implementation Based on Streaming Power Iteration: 4. Principles · English (unofficial) translations of posts at kexue.fm

After the three articles “Muon Implementation Based on Streaming Power Iteration: 1. First Look", “Muon Implementation Based on Streaming Power Iteration: 2. Acceleration", and “Muon Implementation Based on Streaming Power Iteration: 3. Refinement", we believe everyone has gained some understanding of the ideas, implementation, and acceleration details of Streaming Power Iteration. Overall, it can be considered a quite competitive way to implement Muon, and thanks to its direct approximation of SVD, it also possesses better scalability.

Due to space limitations, we previously described the mathematical principles of the relevant operations relatively briefly. Therefore, in this article, we supplement some mathematical derivations regarding power iteration and QR decomposition to build a more complete theoretical picture. However, the derivations here still emphasize explanation rather than rigor, mainly to help everyone (including the author) clarify their thoughts. Professional readers are kindly asked to bear with us.

Coaxial Equivalence

Before starting the derivation, we need to introduce the concept of “Coaxial Equivalence”. For matrices \boldsymbol{A},\boldsymbol{B}\in\mathbb{R}^{n\times m}, if there exists a signature matrix \boldsymbol{S} such that \boldsymbol{A} = \boldsymbol{B}\boldsymbol{S}, then we say \boldsymbol{A} and \boldsymbol{B} are “Coaxial Equivalent”, and they are each other’s “coaxial matrix”. The “Signature Matrix” here refers to a diagonal matrix with \pm 1 on the diagonal, i.e., \mathop{\mathrm{diag}}(\pm 1, \pm 1, \cdots, \pm 1).

It should be clarified that the term “Coaxial” is proposed by the author to describe this equivalence relationship. From the perspective of coordinate systems, matrices \boldsymbol{A} and \boldsymbol{B} satisfying the condition essentially describe the same coordinate system, only with different choices of the positive direction for certain axes. Some literature seems to call them “Sign Equivalent”, but the author feels that “Coaxial” is more intuitive.

The reason for introducing the concept of coaxial equivalence is that many matrix decompositions are unique only in the sense of coaxial equivalence (which also makes the author wonder why such a commonly used equivalence relationship does not have a standard name). For example, for QR decomposition, let matrix \boldsymbol{A}\in\mathbb{R}^{n\times m}(n\geq m) be full-rank, and let \boldsymbol{Q}_1\boldsymbol{R}_1 and \boldsymbol{Q}_2\boldsymbol{R}_2 be two of its QR decompositions; then \boldsymbol{Q}_1 and \boldsymbol{Q}_2 are coaxial, and \boldsymbol{R}_1 and \boldsymbol{R}_2 are coaxial.

This uniqueness is easy to understand, because in the Gram-Schmidt orthogonalization process, we can arbitrarily flip the vectors after orthogonalization without changing the orthogonality. Common textbooks usually restrict the diagonal elements of \boldsymbol{R} to be positive to express the uniqueness of QR decomposition; this is also a way, but sometimes it brings unnecessary trouble. Additionally, the uniqueness of SVD also holds under coaxial equivalence.

Power Iteration

In “Muon Implementation Based on Streaming Power Iteration: 1. First Look", we introduced power iteration by solving eigenvectors one by one, and then directly switched to the parallel version, assuming it has the same convergence results. Here, we start directly from the parallel version to prove its convergence.

Still let matrix \boldsymbol{A}\in\mathbb{R}^{n\times m}(n\geq m) and assume it is full-rank (rank m). Consider the power iteration \boldsymbol{V}_t = \mathop{\mathrm{QR}}(\boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1}),\qquad \boldsymbol{V}_0 = \boldsymbol{I} Let the QR decomposition of \boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1} be \boldsymbol{Q}_t \boldsymbol{R}_t, then \mathop{\mathrm{QR}}(\boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1}) = \boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1}\boldsymbol{R}_t^{-1}. By iterating step by step we get \boldsymbol{V}_t = (\boldsymbol{A}^{\top}\boldsymbol{A})^t \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1} Then assume the SVD of \boldsymbol{A} is \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}, so \boldsymbol{A}^{\top}\boldsymbol{A}=\boldsymbol{V}\boldsymbol{\Sigma}^2\boldsymbol{V}^{\top}. Substitute into the above formula to get \boldsymbol{V}_t = \boldsymbol{V}\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top} \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1} = \boldsymbol{V}\underbrace{(\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top}\boldsymbol{\Sigma}^{-2t})}_{\boldsymbol{B}}\,\underbrace{(\boldsymbol{\Sigma}^{2t} \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1})}_{\boldsymbol{C}} The result consists of three parts: \boldsymbol{V}, \boldsymbol{B}, \boldsymbol{C}. We want to prove that in the sense of coaxial equivalence, \boldsymbol{V}_t\to \boldsymbol{V}. We know that \boldsymbol{V}_t and \boldsymbol{V} are orthogonal matrices, and from the closure property of upper triangular matrices, \boldsymbol{C} is an upper triangular matrix. If we can prove that \boldsymbol{B} tends to an upper triangular matrix, then by the uniqueness of QR decomposition we can obtain \boldsymbol{V}_t\to \boldsymbol{V}. Regarding \boldsymbol{B}=\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top}\boldsymbol{\Sigma}^{-2t}, written in component form it is B_{i,j} = V_{j,i} (\sigma_i / \sigma_j)^{2t} Assume \sigma_1 > \sigma_2 > \cdots > \sigma_m. Then when i > j, (\sigma_i / \sigma_j)^{2t}\to 0, meaning the lower triangular part of \boldsymbol{B} tends to 0. In other words, \boldsymbol{B} tends to an upper triangular matrix, thus the condition is proven. From this we also see that the convergence speed of power iteration depends on the ratio of adjacent singular values: the larger \sigma_i / \sigma_{i+1} is, the faster the convergence. If equal singular values exist, power iteration will fail, but in practice we can assume the probability of two singular values being exactly equal is 0, thereby avoiding this issue.

Essential Thoughts

Let us carefully think about the entire proof process and understand what it essentially does.

First, the block \boldsymbol{C} = \boldsymbol{\Sigma}^{2t} \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1} looks complicated, but its only role is merely “an upper triangular matrix”; that is, what it equals is not important at all, as long as it maintains the form of an upper triangular matrix. The core role is actually played by \boldsymbol{B}=\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top}\boldsymbol{\Sigma}^{-2t} tending to an upper triangular matrix, which allows the preceding \boldsymbol{V} to be extracted by the QR decomposition.

What truly achieves this effect is actually the step (\boldsymbol{A}^{\top}\boldsymbol{A})^t! In other words, theoretically, \boldsymbol{V}_t = \mathop{\mathrm{QR}}((\boldsymbol{A}^{\top}\boldsymbol{A})^t), meaning that all the previous QR operations are theoretically redundant; one only needs to do a single QR or orthogonalization at the end. Of course, this equivalence only has theoretical value, since directly computing (\boldsymbol{A}^{\top}\boldsymbol{A})^t would lead to numerical explosion/collapse, making the result unusable. Therefore, performing QR periodically (rather than just once at the end) essentially serves to maintain numerical stability.

We can also understand the multiple QR operations from the perspective of “preconditioner”. A preconditioner refers to some preprocessing operations on the input; these operations theoretically do not change the output, but are usually beneficial for numerical computation. For QR, right-multiplying by an arbitrary full-rank upper triangular matrix \boldsymbol{R} does not change the result, i.e., \mathop{\mathrm{QR}}(\boldsymbol{A}) = \mathop{\mathrm{QR}}(\boldsymbol{A}\boldsymbol{R}). Therefore, any right-multiplied full-rank upper triangular matrix can be called a preconditioner for QR.

We know that QR itself can be written as right-multiplication by an upper triangular matrix, so QR itself is a preconditioner for QR. Therefore, we can insert some QR operations into (\boldsymbol{A}^{\top}\boldsymbol{A})^t, turning it into (\boldsymbol{A}^{\top}\boldsymbol{A})^t\qquad\to\qquad \boldsymbol{A}^{\top}\boldsymbol{A}\mathop{\mathrm{QR}}(\boldsymbol{A}^{\top}\boldsymbol{A}\mathop{\mathrm{QR}}(\boldsymbol{A}^{\top}\boldsymbol{A}\cdots\mathop{\mathrm{QR}}(\boldsymbol{A}^{\top}\boldsymbol{A}))) Finally, applying QR at both ends does not change the result. However, since each QR produces an orthogonal matrix, right-multiplying by an orthogonal matrix has almost no risk of numerical explosion, making QR a good preconditioner for itself.

Finite Error

We have repeatedly used words like “theoretically” because many results can only be derived based on infinite precision, exact QR. In actual computation, QR is of finite precision, and for efficiency we often use the less accurate “SCQR (Shifted Cholesky QR)”. Therefore, analyzing the cumulative effect of errors is very necessary.

Regarding SCQR, everyone should now be familiar with it after reading these articles. It divides the QR decomposition of matrix \boldsymbol{A} into two steps: 1) Perform Cholesky decomposition on \boldsymbol{A}^{\top}\boldsymbol{A}+\lambda \boldsymbol{I} to obtain \boldsymbol{R}; 2) Obtain the orthogonal matrix \boldsymbol{Q} via \boldsymbol{Q}=\boldsymbol{A}\boldsymbol{R}^{-1}. If \lambda=0, then SCQR is equivalent to exact QR; however, \lambda=0 almost always fails in actual computation due to a large condition number, so setting an appropriate \lambda > 0 introduces some error.

Fortunately, this error does not accumulate! Whether it is \boldsymbol{A}^{\top}\boldsymbol{A} or \boldsymbol{A}^{\top}\boldsymbol{A}+\lambda \boldsymbol{I}, the result of their Cholesky decomposition is an upper triangular matrix, and then \boldsymbol{A} right-multiplies by the inverse of this upper triangular matrix. We know that right-multiplying by an upper triangular matrix does not change the QR result. Therefore, if we disregard the errors from matrix multiplication, SCQR can be considered “lossless” for QR itself.

In other words, as long as the final step is replaced with exact QR, no matter how many inaccurate SCQR steps are performed earlier, the result will still be exact. Conversely, the total error is equivalent to the error of a single SCQR step (the final one) and does not accumulate. The “SCQR2” acceleration technique we mentioned in “Muon Implementation Based on Streaming Power Iteration: 2. Acceleration” also relies on this principle.

Conversely, if some transformations we propose cannot be written in the form of “right-multiplication by an upper triangular matrix”, they will damage the information of the original matrix, and errors will accumulate over long-term iterations until complete failure. The simplification scheme proposed by @Ji_Ha_Kim that we mentioned in the previous article “Muon Implementation Based on Streaming Power Iteration: 3. Refinement” is a classic counterexample.

Cholesky Decomposition

Finally, let us briefly review the Cholesky decomposition. Its purpose is to represent a given positive definite symmetric matrix \boldsymbol{B} in the form \boldsymbol{L}\boldsymbol{L}^{\top}, where \boldsymbol{L} is a lower triangular matrix. Then by setting \boldsymbol{R}=\boldsymbol{L}^{\top}, we can obtain the upper triangular form \boldsymbol{R}^{\top}\boldsymbol{R}.

Cholesky decomposition is fast because it has an analytical solution that can be computed recursively. Specifically, from the following equation \begin{bmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,m} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m,1} & a_{m,2} & \cdots & a_{m,m} \end{bmatrix}=\begin{bmatrix} l_{1,1} & & & \\ l_{2,1} & l_{2,2} & & \\ \vdots & \vdots & \ddots & \\ l_{m,1} & l_{m,2} & \cdots & l_{m,m} \end{bmatrix}\begin{bmatrix} l_{1,1} & l_{2,1} & \cdots & l_{m,1} \\ & l_{2,2} & \cdots & l_{m,2} \\ & & \ddots & \vdots \\ & & & l_{m,m} \end{bmatrix} equating element by element, we can solve for the recursive formula l_{j,j}=\pm\sqrt {a_{j,j}-\sum_{k=1}^{j-1}l_{j,k}^2},\qquad l_{i,j}=\frac{1}{l_{j,j}}\left(a_{i,j}-\sum_{k=1}^{j-1}l_{i,k}l_{j,k}\right)\quad \text{for } i > j

Cholesky QR is one of the classic applications of Cholesky decomposition. It is similar to Gram-Schmidt orthogonalization but allows us to skip the orthogonalization process and directly obtain the triangular matrix \boldsymbol{R}. They also face the same numerical difficulties: Gram-Schmidt orthogonalization orthonormalizes vectors one by one; if it encounters linearly dependent vectors or zero vectors, the orthogonalization cannot proceed. This is reflected in low effective rank or large condition number in terms of singular values, and in such cases Cholesky decomposition is also prone to failure.

Another classic application of Cholesky decomposition is the inversion of positive definite symmetric matrices. For example, to solve the equation \boldsymbol{B}\boldsymbol{X}=\boldsymbol{C}, where \boldsymbol{B} is positive definite symmetric, we can first decompose it into \boldsymbol{L}\boldsymbol{L}^{\top}, then transform it into \boldsymbol{L}(\boldsymbol{L}^{\top}\boldsymbol{X})=\boldsymbol{C}, which only requires solving two triangular equations, with each step being quite efficient. Among them, the fractional iteration method for \mathop{\mathrm{msign}} proposed by @Ji_Ha_Kim is based on this idea for finding the inverse matrix.

Summary

This article mainly supplemented some mathematical derivations of streaming power iteration, including the convergence of power iteration, preconditioners for QR decomposition, a brief introduction to Cholesky decomposition, etc., hoping to help everyone better understand this method from the principle level.

Reprint must include the address of this article: https://kexue.fm/archives/11710

For more detailed reprint matters, please refer to: Science Space FAQ