English (unofficial) translations of posts at kexue.fm
Source

The Road to Transformer Upgrade: 6. Completeness Analysis of Rotary Positional Embedding (RoPE)

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

In last year’s article "The Road to Transformer Upgrade: 2. Rotary Positional Embedding (RoPE) which Incorporates the Strengths of Many", I proposed the Rotary Positional Embedding (RoPE). At that time, the starting point was simply feeling that implementing relative positions through absolute positions was a "fun thing to do," and I did not expect its practical performance to be quite good and widely accepted. This was truly an unexpected surprise. Later, in "The Road to Transformer Upgrade: 4. Rotary Positional Embedding for 2D Positions", I discussed the 2D form of RoPE and studied the general solution of RoPE expressed by matrix exponentials.

Since we have a general solution, a natural question arises: the RoPE we commonly use is just a block-diagonal matrix with 2D rotation matrices as basic units. If we replaced it with the general solution, would the performance theoretically be better? This article aims to answer that question.

General Exponential Solution

In "The Road to Transformer Upgrade: 4. Rotary Positional Embedding for 2D Positions", we abstractly defined RoPE as any square matrix satisfying the following equation: \begin{equation} \boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n=\boldsymbol{\mathcal{R}}_{n-m}\label{eq:re} \end{equation} Then, we explored solutions in the form of the following matrix exponential: \begin{equation} \boldsymbol{\mathcal{R}}_n=\exp n\boldsymbol{B} \end{equation} The matrix exponential here is not an element-wise operation like the Softmax activation function, but rather the "Matrix Exponential" defined by the Taylor series. According to the "Baker–Campbell–Hausdorff formula", we have: \begin{equation} \begin{aligned} \boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n=&\,(\exp m\boldsymbol{B})^{\top}(\exp n\boldsymbol{B}) = (\exp m\boldsymbol{B}^{\top})(\exp n\boldsymbol{B}) \\ =&\,\exp\left(m\boldsymbol{B}^{\top} + n\boldsymbol{B} + \frac{1}{2}mn[\boldsymbol{B}^{\top}, \boldsymbol{B}]+\cdots\right) \end{aligned} \end{equation} Here [\boldsymbol{A}, \boldsymbol{B}]=\boldsymbol{A}\boldsymbol{B}-\boldsymbol{B}\boldsymbol{A}, and \cdots denotes terms of third order or higher in m, n. According to Equation [eq:re], the exponential part should be equal to (n-m)\boldsymbol{B}, which implies: \begin{equation} \boldsymbol{B}^{\top} = - \boldsymbol{B} \end{equation} That is, \boldsymbol{B} is required to be a skew-symmetric matrix.

General Orthogonal Solution

Furthermore, we have (\exp \boldsymbol{B})^{\top}(\exp \boldsymbol{B})=\exp(\boldsymbol{B}-\boldsymbol{B}) = \boldsymbol{I} and \exp n\boldsymbol{B} = (\exp\boldsymbol{B})^n. The former shows that \exp\boldsymbol{B} is an orthogonal matrix, while the latter suggests whether this can be generalized to any orthogonal matrix. It is not difficult to verify that the answer is affirmative; we have the following conclusion:

For any orthogonal matrix \boldsymbol{O}, \boldsymbol{\mathcal{R}}_n=\boldsymbol{O}^n is a solution that satisfies Equation [eq:re].

It is worth noting that in the real field, not all orthogonal matrices can be written in the form \exp\boldsymbol{B}, so \boldsymbol{O}^n is actually broader than the matrix exponential form. From "Appreciation of the Identity \det(\exp(\boldsymbol{A})) = \exp(\text{Tr}(\boldsymbol{A}))", we know that \det(\exp(\boldsymbol{A})) = \exp(\text{Tr}(\boldsymbol{A})) > 0, so the determinant of an orthogonal matrix that can be written in matrix exponential form must be greater than 0 (i.e., equal to 1). In fact, the converse is also true: any orthogonal matrix with a determinant equal to 1 can necessarily be written in the form \exp\boldsymbol{B}, where \boldsymbol{B} is a skew-symmetric matrix (refer to "Why can any orthogonal matrix be written as O=e^A").

For an orthogonal matrix with \det(\boldsymbol{O}) = -1, we have \boldsymbol{O}=\boldsymbol{O}_+ \boldsymbol{I}_-, where \boldsymbol{I}_- is a diagonal matrix with one -1 and the rest 1s on the diagonal, and \boldsymbol{O}_+ is an orthogonal matrix with \det(\boldsymbol{O}_+) = 1, which can be written in the form \exp\boldsymbol{B}. In this case, \boldsymbol{O}^n = (\boldsymbol{O}_+ \boldsymbol{I}_-)^n = \boldsymbol{I}_-^n\exp n\boldsymbol{B}. This means that even for \boldsymbol{O}^n where \det(\boldsymbol{O}) = -1, it is just a simple transformation of \exp n\boldsymbol{B}. Therefore, we will primarily focus on solutions of the form \exp n\boldsymbol{B}.

Completeness Analysis

As is well known, the RoPE positional encoding we commonly use is a block-diagonal matrix of the following form: \begin{equation} \scriptsize{\left(\begin{array}{cc:cc:cc:cc} \cos n\theta_0 & -\sin n\theta_0 & 0 & 0 & \cdots & \cdots & 0 & 0 \\ \sin n\theta_0 & \cos n\theta_0 & 0 & 0 & \cdots & \cdots & 0 & 0 \\ \hdashline 0 & 0 & \cos n\theta_1 & -\sin n\theta_1 & \cdots & \cdots & 0 & 0 \\ 0 & 0 & \sin n\theta_1 & \cos n\theta_1 & \cdots & \cdots & 0 & 0 \\ \hdashline \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ \hdashline 0 & 0 & 0 & 0 & \cdots & \cdots & \cos n\theta_{d/2-1} & -\sin n\theta_{d/2-1} \\ 0 & 0 & 0 & 0 & \cdots & \cdots & \sin n\theta_{d/2-1} & \cos n\theta_{d/2-1} \\ \end{array}\right)} \end{equation} It can be abbreviated as: \begin{equation} \begin{pmatrix} \boldsymbol{R}_{n\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{R}_{n\theta_1} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{R}_{n\theta_{d/2-1}} \\ \end{pmatrix} = \exp n\begin{pmatrix} \boldsymbol{J}_{\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{J}_{\theta_1} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{J}_{\theta_{d/2-1}} \\ \end{pmatrix} \end{equation} where \begin{equation} \boldsymbol{R}_{\theta} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix},\quad \boldsymbol{J}_{\theta} = \begin{pmatrix} 0 & -\theta \\ \theta & 0\end{pmatrix} \end{equation} This choice can be said to be the simplest one, and its essential reason is to reduce the computational cost. Thus, the so-called completeness problem is to answer: does this special case of a block-diagonal matrix lack any capability compared to the full-parameter \exp n\boldsymbol{B}? In other words, if we ignore the computational cost and replace \boldsymbol{B} with a general skew-symmetric matrix, would there be a potential improvement in performance?

Answering this question is not difficult. In fact, any even-order skew-symmetric matrix can be diagonalized into a block-diagonal matrix: \begin{equation} \boldsymbol{\Lambda} = \begin{pmatrix} \boldsymbol{J}_{\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{J}_{\theta_1} & \cdots & \boldsymbol{0} \\ \vdots & \vdots & \ddots & \vdots \\ \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{J}_{\theta_{d/2-1}} \\ \end{pmatrix} \end{equation} This conclusion can be found in Skew-symmetric matrix. That is to say, there exists an invertible matrix \boldsymbol{P} such that \boldsymbol{B}=\boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}, and thus: \begin{equation} \exp n\boldsymbol{B} = \exp (n\boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}) = \boldsymbol{P}(\exp n\boldsymbol{\Lambda})\boldsymbol{P}^{-1} \end{equation} This means that any \exp n\boldsymbol{B} and the block-diagonal \exp n\boldsymbol{\Lambda} differ only by a similarity transformation. When we apply RoPE in Self-Attention, we have: \begin{equation} \boldsymbol{q}^{\top}(\exp (n-m)\boldsymbol{B})\boldsymbol{k} = (\boldsymbol{P}^{\top}\boldsymbol{q})^{\top}(\exp (n-m)\boldsymbol{\Lambda})(\boldsymbol{P}^{-1}\boldsymbol{k}) \end{equation} Since \boldsymbol{q}, \boldsymbol{k} are generally derived from the input \boldsymbol{x} through some learnable linear transformation, \boldsymbol{P}^{\top} and \boldsymbol{P}^{-1} can, in principle, be absorbed into the trainable parameters of the linear transformation. Therefore, setting it directly to \boldsymbol{q}^{\top}(\exp (n-m)\boldsymbol{\Lambda})\boldsymbol{k} theoretically does not lose generality.

So, for Self-Attention, the answer to the question is negative. However, if it is Linear Attention, the answer will be slightly different because in Linear Attention, \boldsymbol{q}, \boldsymbol{k} are followed by an activation function: \begin{equation} \phi(\boldsymbol{q})^{\top}(\exp (n-m)\boldsymbol{B})\varphi(\boldsymbol{k}) = (\boldsymbol{P}^{\top}\phi(\boldsymbol{q}))^{\top}(\exp (n-m)\boldsymbol{\Lambda})(\boldsymbol{P}^{-1}\varphi(\boldsymbol{k})) \end{equation} This results in \boldsymbol{P}^{\top} and \boldsymbol{P}^{-1} not necessarily being absorbable into the trainable parameters of the linear transformation. Therefore, adding two parameter matrices to Linear Attention might potentially bring improvements.

Summary

This article briefly analyzes the completeness of RoPE, showing that for Self-Attention, the current block-diagonal RoPE does not lose generality.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/9403

Further details on reprinting: Please refer to "Scientific Space FAQ".