Readers who have been following the "Transformer Upgrade Road" series up to this point are likely already familiar with Rotary Positional Encoding (RoPE). Simply put, RoPE is a rotary transformation applied to the Query (\boldsymbol{Q}) and Key (\boldsymbol{K}) of the Attention mechanism. Formally, it belongs to the category of absolute positional encoding, but when combined with the dot-product property of Attention, it automatically achieves a relative positional effect.
So, can RoPE be applied to the Value (\boldsymbol{V})? At first glance, it seems not, because rotating \boldsymbol{V} does not result in relative positional encoding. However, things are not quite so absolute. In this article, we will discuss applying RoPE to \boldsymbol{V}, which we can call the "Second Type of Rotary Positional Encoding."
Background Review
We can decompose Dot-Product Attention as: \begin{equation} \boldsymbol{o}_i = \sum_j a_{i,j}\boldsymbol{v}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_j e^{s_{i,j}}},\qquad s_{i,j} = \boldsymbol{q}_i^{\top}\boldsymbol{k}_j \end{equation} For simplicity, the scaling factor for s_{i,j} is omitted here. RoPE is applied to \boldsymbol{q}_i and \boldsymbol{k}_j: \begin{equation} \boldsymbol{q}_i \to \boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i,\qquad \boldsymbol{k}_j \to \boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j \end{equation} This causes the Attention Logits, s_{i,j}, to become: \begin{equation} s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j=\boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{k}_j \end{equation} In other words, s_{i,j} only depends on the relative position j-i, thereby achieving a relative positional effect through an absolute positional form. This transformation process utilizes the property of rotation matrices: \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j=\boldsymbol{\mathcal{R}}_{j-i}.
In addition to rotation matrices, in "Transformer Upgrade Road: 4, Rotary Positional Encoding for 2D Positions", we proved that the general solution is \boldsymbol{\mathcal{R}}_i = \boldsymbol{O}^i, where \boldsymbol{O} is any orthogonal matrix and the superscript denotes matrix exponentiation. However, we later proved in "Transformer Upgrade Road: 6, Completeness Analysis of Rotary Positional Encoding" that general orthogonal matrix solutions are essentially isomorphic to rotation matrix solutions.
New Usage
What if we apply RoPE to \boldsymbol{v}_j, i.e., \boldsymbol{v}_j \to \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j? Clearly, the result of the Attention is: \begin{equation} \boldsymbol{o}_i = \sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j \label{eq:v-rope-abs} \end{equation} This would cause the Attention to explicitly depend on the absolute position j. If we only wanted some form of positional encoding, this might not be a problem, but if we specifically want relative positional encoding, it does not satisfy our objective.
However, there is a simple trick to solve this flaw! We can apply an inverse RoPE to \boldsymbol{o}_i: \begin{equation} \boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\left(\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j\right)=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j \label{eq:vo-rope} \end{equation} In this way, it becomes a relative positional encoding once again! Formally, it also consists of two absolute positional encodings, similar to the existing RoPE. Therefore, we call it the "Second Type of Rotary Positional Encoding," or more intuitively, "VO-RoPE," because it adds RoPE to both the Value and the Output. Correspondingly, standard RoPE can be called "QK-RoPE."
Simple Experiment
A quick round of experiments was conducted on a LLaMA-like model with approximately 1B parameters. The configurations compared were:
1. NoPE: No positional encoding at all;
2. QK-RoPE: Standard rotary positional encoding;
3. VO-RoPE: The second type of rotary positional encoding proposed in this article;
4. Q/K/V/O-RoPE: Applying rotary positional encoding individually to one of Q, K, V, or O;
5. QKV-RoPE: Applying rotary positional encoding to Q, K, and V;
6. QKVO-RoPE: Applying rotary positional encoding to Q, K, V, and O.
Note that points 4 and 5 are considered absolute positional encodings. The general conclusion is: \text{QK-RoPE} \approx \text{QKVO-RoPE} > \text{K-RoPE} \approx \text{VO-RoPE} > \text{QKV-RoPE} > \text{NoPE} > \text{Q/V/O-RoPE}
The specific differences in the loss function are:
| Configuration | Loss |
|---|---|
| QK-RoPE | 2.712 |
| QKVO-RoPE | 2.719 |
| K-RoPE | 2.769 |
| VO-RoPE | 2.770 |
| QKV-RoPE | 2.783 |
| NoPE | 2.795 |
| O-RoPE | 2.841 |
| Q-RoPE | 2.851 |
| V-RoPE | 2.856 |
Some Thoughts
From the results above, it can be seen that VO-RoPE is superior to NoPE but inferior to QK-RoPE, and stacking VO-RoPE with QK-RoPE does not provide any gain. In this light, does VO-RoPE seem unnecessary?
In the author’s view, completing the usage of RoPE, answering the question "Can RoPE be added to Value?", and clarifying through experiments that "there is no significant gain" is valuable in itself. Furthermore, in the long run, it might not always lack benefits; it just might not show its effect under our current mainstream language model settings. When the author first proposed RoPE, the motivation was simply for fun, without expecting it to become a competitive positional encoding (what happened later was a stroke of luck).
Currently, VO-RoPE has a potential application scenario related to MLA, introduced in "The Ultimate Tug-of-War Between Cache and Performance: From MHA, MQA, GQA to MLA". We know that in the inference stage, MLA is approximately equivalent to an MQA where K and V are shared: \begin{equation} \boldsymbol{o}_i = \sum_{j=1}^i a_{i,j}\boldsymbol{c}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = \exp(\boldsymbol{q}_i^{\top}\boldsymbol{c}_j) \end{equation} This property allows its KV Cache to have only one \boldsymbol{c}. However, this important feature is incompatible with QK-RoPE because once RoPE is added to \boldsymbol{c}_j inside the Attention matrix, two outcomes arise:
1. If \boldsymbol{c}_j on the Value side does not have RoPE, then K and V are no longer fully shared. This leads to either doubling the KV Cache (caching both before and after RoPE) or injecting RoPE into K in real-time (introducing latency);
2. If \boldsymbol{c}_j on the Value side does have RoPE, the K and V sharing effect is achieved, but it is no longer a relative positional encoding.
To solve this, MLA adopts a "mostly NoPE + a small part RoPE" concatenation approach. However, from the second type of rotary positional encoding discussed in this article, we know that we only need to add an O-RoPE to the Output: \begin{equation} \boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\sum_{j=1}^i a_{i,j}(\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j),\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j) \end{equation} However, this idea has not been fully realized yet and cannot be directly applied to the training form of MLA; it is provided here for reference.
Summary
This article centered around the question "Can RoPE be added to V?" and discussed the second usage of RoPE.