Transformer Upgrade Road: 19, The Second Type of Rotary Positional Encoding · English (unofficial) translations of posts at kexue.fm

Background Review

We can decompose Dot-Product Attention as: \begin{equation} \boldsymbol{o}_i = \sum_j a_{i,j}\boldsymbol{v}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_j e^{s_{i,j}}},\qquad s_{i,j} = \boldsymbol{q}_i^{\top}\boldsymbol{k}_j \end{equation} For simplicity, the scaling factor for s_{i,j} is omitted here. RoPE is applied to \boldsymbol{q}_i and \boldsymbol{k}_j: \begin{equation} \boldsymbol{q}_i \to \boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i,\qquad \boldsymbol{k}_j \to \boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j \end{equation} This causes the Attention Logits, s_{i,j}, to become: \begin{equation} s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j=\boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{k}_j \end{equation} In other words, s_{i,j} only depends on the relative position j-i, thereby achieving a relative positional effect through an absolute positional form. This transformation process utilizes the property of rotation matrices: \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j=\boldsymbol{\mathcal{R}}_{j-i}.

In addition to rotation matrices, in "Transformer Upgrade Road: 4, Rotary Positional Encoding for 2D Positions", we proved that the general solution is \boldsymbol{\mathcal{R}}_i = \boldsymbol{O}^i, where \boldsymbol{O} is any orthogonal matrix and the superscript denotes matrix exponentiation. However, we later proved in "Transformer Upgrade Road: 6, Completeness Analysis of Rotary Positional Encoding" that general orthogonal matrix solutions are essentially isomorphic to rotation matrix solutions.

New Usage

What if we apply RoPE to \boldsymbol{v}_j, i.e., \boldsymbol{v}_j \to \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j? Clearly, the result of the Attention is: \begin{equation} \boldsymbol{o}_i = \sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j \label{eq:v-rope-abs} \end{equation} This would cause the Attention to explicitly depend on the absolute position j. If we only wanted some form of positional encoding, this might not be a problem, but if we specifically want relative positional encoding, it does not satisfy our objective.

However, there is a simple trick to solve this flaw! We can apply an inverse RoPE to \boldsymbol{o}_i: \begin{equation} \boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\left(\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j\right)=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j \label{eq:vo-rope} \end{equation} In this way, it becomes a relative positional encoding once again! Formally, it also consists of two absolute positional encodings, similar to the existing RoPE. Therefore, we call it the "Second Type of Rotary Positional Encoding," or more intuitively, "VO-RoPE," because it adds RoPE to both the Value and the Output. Correspondingly, standard RoPE can be called "QK-RoPE."

Simple Experiment

A quick round of experiments was conducted on a LLaMA-like model with approximately 1B parameters. The configurations compared were:

1. NoPE: No positional encoding at all;

2. QK-RoPE: Standard rotary positional encoding;

3. VO-RoPE: The second type of rotary positional encoding proposed in this article;

4. Q/K/V/O-RoPE: Applying rotary positional encoding individually to one of Q, K, V, or O;

5. QKV-RoPE: Applying rotary positional encoding to Q, K, and V;

6. QKVO-RoPE: Applying rotary positional encoding to Q, K, V, and O.

Note that points 4 and 5 are considered absolute positional encodings. The general conclusion is: \text{QK-RoPE} \approx \text{QKVO-RoPE} > \text{K-RoPE} \approx \text{VO-RoPE} > \text{QKV-RoPE} > \text{NoPE} > \text{Q/V/O-RoPE}

The specific differences in the loss function are:

Configuration	Loss
QK-RoPE	2.712
QKVO-RoPE	2.719
K-RoPE	2.769
VO-RoPE	2.770
QKV-RoPE	2.783
NoPE	2.795
O-RoPE	2.841
Q-RoPE	2.851
V-RoPE	2.856

Some Thoughts

From the results above, it can be seen that VO-RoPE is superior to NoPE but inferior to QK-RoPE, and stacking VO-RoPE with QK-RoPE does not provide any gain. In this light, does VO-RoPE seem unnecessary?

In the author’s view, completing the usage of RoPE, answering the question "Can RoPE be added to Value?", and clarifying through experiments that "there is no significant gain" is valuable in itself. Furthermore, in the long run, it might not always lack benefits; it just might not show its effect under our current mainstream language model settings. When the author first proposed RoPE, the motivation was simply for fun, without expecting it to become a competitive positional encoding (what happened later was a stroke of luck).

Currently, VO-RoPE has a potential application scenario related to MLA, introduced in "The Ultimate Tug-of-War Between Cache and Performance: From MHA, MQA, GQA to MLA". We know that in the inference stage, MLA is approximately equivalent to an MQA where K and V are shared: \begin{equation} \boldsymbol{o}_i = \sum_{j=1}^i a_{i,j}\boldsymbol{c}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = \exp(\boldsymbol{q}_i^{\top}\boldsymbol{c}_j) \end{equation} This property allows its KV Cache to have only one \boldsymbol{c}. However, this important feature is incompatible with QK-RoPE because once RoPE is added to \boldsymbol{c}_j inside the Attention matrix, two outcomes arise:

1. If \boldsymbol{c}_j on the Value side does not have RoPE, then K and V are no longer fully shared. This leads to either doubling the KV Cache (caching both before and after RoPE) or injecting RoPE into K in real-time (introducing latency);

2. If \boldsymbol{c}_j on the Value side does have RoPE, the K and V sharing effect is achieved, but it is no longer a relative positional encoding.

To solve this, MLA adopts a "mostly NoPE + a small part RoPE" concatenation approach. However, from the second type of rotary positional encoding discussed in this article, we know that we only need to add an O-RoPE to the Output: \begin{equation} \boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\sum_{j=1}^i a_{i,j}(\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j),\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j) \end{equation} However, this idea has not been fully realized yet and cannot be directly applied to the training form of MLA; it is provided here for reference.

Related Work

In fact, VO-RoPE also cleverly provides an intermediate form from Attention to complex linear RNNs (such as LRU and RetNet). Starting from equation [eq:vo-rope], considering a Causal scenario, and taking a special case a_{i,j}=\gamma^{i-j} where 0 < \gamma < 1, we get: \begin{equation} \boldsymbol{o}_i = \sum_{j=1}^i \gamma^{i-j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j \end{equation} We know that the rotation matrix \boldsymbol{\mathcal{R}}_{j-i} written in complex form is actually a diagonal matrix of e^{\mathbb{I}\theta (j - i)}, where \mathbb{I} is the imaginary unit (i.e., \mathbb{I}^2=-1). To distinguish it from the index i, it is written as \mathbb{I} here. Thus, the above equation is equivalent to: \begin{equation} \boldsymbol{o}_i = \sum_{j=1}^i \gamma^{i-j} e^{\mathbb{I}\theta (j - i)} \boldsymbol{v}_j = \sum_{j=1}^i (\gamma e^{-\mathbb{I}\theta})^{i-j} \boldsymbol{v}_j \end{equation} This is essentially the simplest linear RNN with complex decay. From the derivation in "Google’s New Work Attempts to ’Revive’ RNNs: Can RNNs Shine Again?", this type of RNN is theoretically more complete than RNNs with purely real decay.

Therefore, supplementing RoPE with the VO-RoPE form is equivalent to a general generalization from real linear RNNs to complex linear RNNs. Theoretically, it makes its capabilities more complete, even if this completeness does not necessarily help in language modeling tasks—just as LRU with complex numbers did not show an advantage over the purely real RWKV. However, theoretical completeness might imply special value in certain scenarios. Who knows?

Side Note: After sharing this article on Twitter, some readers reported that they had previously tried VO-RoPE, including:

1. @gharik mentioned he had tried QKVO-RoPE before and achieved some positive results, naming it "RoPER" at the time. More details can be found here and here;

2. @vinam_arora pointed out that he had tried VO-RoPE in a "brain decoding task" and the results were also positive. The paper is "A Unified, Scalable Framework for Neural Population Decoding".

Summary

This article centered around the question "Can RoPE be added to V?" and discussed the second usage of RoPE.