What Can the Matrix Sign Function mcsgn Calculate? · English (unofficial) translations of posts at kexue.fm

In the article "Derivatives of msign", we formally introduced two matrix sign functions, \mathop{\text{msign}} and \mathop{\text{mcsgn}}. Among them, \mathop{\text{msign}} is the core operation of Muon, while \mathop{\text{mcsgn}} is used to solve the Sylvester equation. So, besides solving the Sylvester equation, what else can \mathop{\text{mcsgn}} do? This article aims to organize the answers to this question.

Two Types of Signs

Let the matrix \boldsymbol{M} \in \mathbb{R}^{n \times m}. We have two matrix sign functions: \begin{gather} \mathop{\text{msign}}(\boldsymbol{M}) = (\boldsymbol{M}\boldsymbol{M}^{\top})^{-1/2}\boldsymbol{M} = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2} \\[6pt] \mathop{\text{mcsgn}}(\boldsymbol{M}) = (\boldsymbol{M}^2)^{-1/2}\boldsymbol{M} = \boldsymbol{M}(\boldsymbol{M}^2)^{-1/2} \end{gather} The first type is applicable to matrices of any shape, while the second type is only applicable to square matrices. The exponent ^{-1/2} denotes the inverse of the matrix square root; if it is not invertible, it is calculated according to the "pseudo-inverse". Generally, \mathop{\text{msign}} and \mathop{\text{mcsgn}} yield different results, but they are equal when \boldsymbol{M} is a symmetric matrix.

The difference between them is as follows: if \boldsymbol{M} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}, where \boldsymbol{U}, \boldsymbol{V} are orthogonal matrices, then \mathop{\text{msign}}(\boldsymbol{M}) = \boldsymbol{U}\mathop{\text{msign}}(\boldsymbol{\Sigma})\boldsymbol{V}^{\top}. If \boldsymbol{M} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}, where \boldsymbol{P} is an invertible matrix, then \mathop{\text{mcsgn}}(\boldsymbol{M}) = \boldsymbol{P}\mathop{\text{mcsgn}}(\boldsymbol{\Lambda})\boldsymbol{P}^{-1}. Simply put, one possesses orthogonal invariance, while the other possesses similarity invariance; one transforms all non-zero singular values to 1, while the other transforms all non-zero eigenvalues to \pm 1.

Regarding the calculation of \mathop{\text{msign}}, one can refer to "Newton-Schulz Iteration for the msign Operator (Part 1)" and "Newton-Schulz Iteration for the msign Operator (Part 2)"; it is GPU-efficient. As for \mathop{\text{mcsgn}}, since eigenvalues can be complex, the general case can be quite complicated. However, when the eigenvalues of \boldsymbol{M} are all real (which is the case in almost all scenarios where \mathop{\text{mcsgn}} is used), the iteration for \mathop{\text{msign}} can be reused: \begin{equation} \boldsymbol{X}_0 = \frac{\boldsymbol{M}}{\sqrt{\mathop{\text{tr}}(\boldsymbol{M}^2)}}, \qquad \boldsymbol{X}_{t+1} = a_{t+1}\boldsymbol{X}_t + b_{t+1}\boldsymbol{X}_t^3 + c_{t+1}\boldsymbol{X}_t^5 \end{equation}

We will not expand further on more properties; next, we will primarily look at the applications of \mathop{\text{mcsgn}}.

Block Identities

Historically, \mathop{\text{mcsgn}} was introduced to solve equations, and not just the Sylvester equation, but also the more general Algebraic Riccati Equation. The original paper can be found in "Solving the algebraic Riccati equation with the matrix sign function".

Consider the block matrix \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}. We have \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}^{-1} = \begin{bmatrix}\boldsymbol{0} & \boldsymbol{I} \\ -\boldsymbol{I} & \boldsymbol{X}\end{bmatrix}. We can verify that: \begin{equation} \begin{bmatrix}\boldsymbol{0} & \boldsymbol{I} \\ -\boldsymbol{I} & \boldsymbol{X}\end{bmatrix} \begin{bmatrix}\boldsymbol{A} & \boldsymbol{C} \\ \boldsymbol{D} & \boldsymbol{B}\end{bmatrix} \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix} = \begin{bmatrix}\boldsymbol{B} + \boldsymbol{D}\boldsymbol{X} & -\boldsymbol{D} \\ \boldsymbol{X}\boldsymbol{D}\boldsymbol{X} + \boldsymbol{X}\boldsymbol{B} - \boldsymbol{A}\boldsymbol{X} - \boldsymbol{C} & \boldsymbol{A} - \boldsymbol{X}\boldsymbol{D}\end{bmatrix} \end{equation} If \begin{equation} \boldsymbol{X}\boldsymbol{D}\boldsymbol{X} + \boldsymbol{X}\boldsymbol{B} - \boldsymbol{A}\boldsymbol{X} - \boldsymbol{C} = \boldsymbol{0} \label{eq:riccati} \end{equation} Then \begin{equation} \begin{bmatrix}\boldsymbol{A} & \boldsymbol{C} \\ \boldsymbol{D} & \boldsymbol{B}\end{bmatrix} = \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix} \begin{bmatrix}\boldsymbol{B} + \boldsymbol{D}\boldsymbol{X} & -\boldsymbol{D} \\ \boldsymbol{0} & \boldsymbol{A} - \boldsymbol{X}\boldsymbol{D}\end{bmatrix} \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}^{-1} \end{equation} Equation [eq:riccati] is the Algebraic Riccati Equation. Taking \mathop{\text{mcsgn}} on both sides, we have the identity: \begin{equation} \begin{aligned} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{A} & \boldsymbol{C} \\ \boldsymbol{D} & \boldsymbol{B}\end{bmatrix}\right) &= \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{B} + \boldsymbol{D}\boldsymbol{X} & -\boldsymbol{D} \\ \boldsymbol{0} & \boldsymbol{A} - \boldsymbol{X}\boldsymbol{D}\end{bmatrix}\right) \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}^{-1} \\[6pt] &= \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix} \begin{bmatrix}\mathop{\text{mcsgn}}(\boldsymbol{B} + \boldsymbol{D}\boldsymbol{X}) & \boldsymbol{Y} \\ \boldsymbol{0} & \mathop{\text{mcsgn}}(\boldsymbol{A} - \boldsymbol{X}\boldsymbol{D})\end{bmatrix} \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}^{-1} \end{aligned} \end{equation} The second equality utilizes the properties of (block) triangular matrices. The eigenvalues of a triangular matrix are its diagonal elements, so when taking \mathop{\text{mcsgn}} of a triangular matrix, the result is also a triangular matrix whose diagonal elements are equal to the \text{csgn} of the original matrix’s diagonal elements. This property also holds for block triangular matrices, so the result takes the form of the second equality, where \boldsymbol{Y} is a matrix to be determined.

Several Results

Below, we further simplify based on specific situations to obtain some results that might be useful.

First Example

Assume \boldsymbol{D} = \boldsymbol{0}, \boldsymbol{B} is positive definite, and \boldsymbol{A} is negative definite. The operation on block diagonal matrices is closed, so \boldsymbol{Y} = \boldsymbol{0}. Then: \begin{equation} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{A} & \boldsymbol{C} \\ \boldsymbol{0} & \boldsymbol{B}\end{bmatrix}\right) = \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix} \begin{bmatrix}\boldsymbol{I} & \boldsymbol{0} \\ \boldsymbol{0} & -\boldsymbol{I}\end{bmatrix} \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}^{-1} = \begin{bmatrix}-\boldsymbol{I} & 2\boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix} \end{equation} This means that the solution to the Sylvester equation \boldsymbol{X}\boldsymbol{B} - \boldsymbol{A}\boldsymbol{X} = \boldsymbol{C} can be read directly from \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{A} & \boldsymbol{C} \\ \boldsymbol{0} & \boldsymbol{B}\end{bmatrix}\right).

Second Example

Assume \boldsymbol{A}, \boldsymbol{B} = \boldsymbol{0}, \boldsymbol{D} = \boldsymbol{I}, and \boldsymbol{C} is a positive definite matrix. Then the Riccati equation simplifies to \boldsymbol{X}^2 = \boldsymbol{C}, i.e., \boldsymbol{X} = \boldsymbol{C}^{1/2}. Thus \mathop{\text{mcsgn}}(\boldsymbol{C}^{1/2}) = \boldsymbol{I}, so: \begin{equation} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{0} & \boldsymbol{C} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}\right) = \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix} \begin{bmatrix}\boldsymbol{I} & \boldsymbol{Y} \\ \boldsymbol{0} & -\boldsymbol{I}\end{bmatrix} \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}^{-1} = \begin{bmatrix}-\boldsymbol{X}\boldsymbol{Y}-\boldsymbol{I} & 2\boldsymbol{X} + \boldsymbol{X}\boldsymbol{Y}\boldsymbol{X} \\ -\boldsymbol{Y} & \boldsymbol{Y}\boldsymbol{X} + \boldsymbol{I}\end{bmatrix} \end{equation} Note that \mathop{\text{mcsgn}} is an odd function. An odd function of an anti-diagonal matrix must also be an anti-diagonal matrix; therefore, \boldsymbol{Y}\boldsymbol{X} + \boldsymbol{I} = \boldsymbol{0}. Solving this gives \boldsymbol{Y} = -\boldsymbol{X}^{-1} = -\boldsymbol{C}^{-1/2}. Substituting this back into the equation above yields: \begin{equation} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{0} & \boldsymbol{C} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}\right) = \begin{bmatrix}\boldsymbol{0} & \boldsymbol{C}^{1/2} \\ \boldsymbol{C}^{-1/2} & \boldsymbol{0}\end{bmatrix} \end{equation} This indicates that \mathop{\text{mcsgn}} can also be used to calculate the square root and inverse square root of a matrix.

Third Example

Assume \boldsymbol{A}, \boldsymbol{B} = \boldsymbol{0} and \boldsymbol{D} = \boldsymbol{C}^{\top}. Then the Riccati equation simplifies to \boldsymbol{X}\boldsymbol{C}^{\top}\boldsymbol{X} = \boldsymbol{C}. It is easy to verify that \boldsymbol{X} = \mathop{\text{msign}}(\boldsymbol{C}) is indeed its solution. We demonstrate only the most ideal case where \boldsymbol{C} is a full-rank square matrix. Then \boldsymbol{C}^{\top}\boldsymbol{X} and \boldsymbol{X}\boldsymbol{C}^{\top} are both positive definite, so we have: \begin{equation} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{0} & \boldsymbol{C} \\ \boldsymbol{C}^{\top} & \boldsymbol{0}\end{bmatrix}\right) = \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix} \begin{bmatrix}\boldsymbol{I} & \boldsymbol{Y} \\ \boldsymbol{0} & -\boldsymbol{I}\end{bmatrix} \begin{bmatrix}\boldsymbol{X} & -\boldsymbol{I} \\ \boldsymbol{I} & \boldsymbol{0}\end{bmatrix}^{-1} = \begin{bmatrix}-\boldsymbol{X}\boldsymbol{Y}-\boldsymbol{I} & 2\boldsymbol{X} + \boldsymbol{X}\boldsymbol{Y}\boldsymbol{X} \\ -\boldsymbol{Y} & \boldsymbol{Y}\boldsymbol{X} + \boldsymbol{I}\end{bmatrix} \end{equation} By the same logic as the previous section, \boldsymbol{Y}\boldsymbol{X} + \boldsymbol{I} = 0, so: \begin{equation} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{0} & \boldsymbol{C} \\ \boldsymbol{C}^{\top} & \boldsymbol{0}\end{bmatrix}\right) = \begin{bmatrix} \boldsymbol{0} & \mathop{\text{msign}}(\boldsymbol{C}) \\ \mathop{\text{msign}}(\boldsymbol{C}^{\top}) & \boldsymbol{0} \end{bmatrix} \end{equation} That is, \mathop{\text{mcsgn}} can also be used to calculate \mathop{\text{msign}}. In fact, it can be directly proven that this equality holds for any matrix \boldsymbol{C}, but proving it from the perspective of solving the Riccati equation involves some tedious details, which the reader may supplement.

Fourth Example

The second and third examples can be generalized into a more general conclusion: \begin{equation} \mathop{\text{mcsgn}}\left(\begin{bmatrix}\boldsymbol{0} & \boldsymbol{C} \\ \boldsymbol{D} & \boldsymbol{0}\end{bmatrix}\right) = \begin{bmatrix}\boldsymbol{0} & \boldsymbol{C}(\boldsymbol{D}\boldsymbol{C})^{-1/2} \\ \boldsymbol{D}(\boldsymbol{C}\boldsymbol{D})^{-1/2} & \boldsymbol{0}\end{bmatrix} \end{equation} This holds for any \boldsymbol{C}, \boldsymbol{D} of suitable shapes. Readers are invited to complete the proof themselves.

Summary

This article primarily organizes several identities related to \mathop{\text{mcsgn}} from the perspective of solving the Algebraic Riccati Equation.

Original Address: https://kexue.fm/archives/11056

For more details on reprinting, please refer to: "Scientific Space FAQ"