Transformer Upgrade Road: 12. ReRoPE with Infinite Extrapolation? · English (unofficial) translations of posts at kexue.fm

Since introducing the idea of mixed-radix positions to further generalize NTK-aware Scaled RoPE in "Transformer Upgrade Road: 11. Taking the \beta-radix Position to the End", I felt that the effectiveness of similar approaches had reached its limit. To achieve more significant improvements, a different path was necessary. At this point, I recalled an idea I had conceived earlier, which had been set aside due to its high complexity. Since I have now encountered a bottleneck, "the only way is the best way," so I decided to revisit it.

To my surprise, although this method increases inference complexity slightly, its experimental results are remarkably good—it even hints at an infinite capacity for length extrapolation! Therefore, I am eager to share this method in this article. Due to its formal similarity to the ReLU activation function, I have named this method "ReRoPE (Rectified Rotary Position Embeddings)."

Review

We know that RoPE is formally an absolute position encoding, but in practice, it brings relative position information to the Attention mechanism, specifically the following Toeplitz matrix: \begin{pmatrix} 0 & \\ 1 & 0 & \\ 2 & 1 & 0 &\\ 3 & 2 & 1 & 0 & \\ \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \text{\small $L - 2$} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \text{\small $L - 1$} & \text{\small $L - 2$} & \ddots & \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \end{pmatrix} \label{eq:rope} Here L is the current sample length. When L significantly exceeds the training length, the extra positions have not been trained, so their performance cannot be guaranteed. This is why direct length extrapolation usually performs poorly.

Later, researchers proposed Position Interpolation (PI), which is equivalent to changing the relative position matrix to: \begin{pmatrix} 0 & \\ \frac{1}{k} & 0 & \\ \frac{2}{k} & \frac{1}{k} & 0 &\\ \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \ddots & \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \ddots & \ddots & \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \text{\small $\frac{L-2}{k}$} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \text{\small $\frac{L-1}{k}$} & \text{\small $\frac{L-1}{k}$} & \ddots & \ddots & \ddots & \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \end{pmatrix} In this way, by adjusting k, we can ensure that the maximum relative position does not exceed the training length, thus avoiding extrapolation. However, it makes the position information more "crowded," so a certain number of fine-tuning steps are required for the model to work properly again. Because it avoids extrapolation, the number of fine-tuning steps required is much smaller than that for direct extrapolation (neural networks are often better at interpolation than extrapolation).

As for the subsequently proposed NTK-aware Scaled RoPE, it takes a "clever" approach by distributing the extrapolation pressure across each dimension. Thus, it performs well even without fine-tuning, but it ultimately still relies on extrapolation—something neural networks are not good at—so its performance has an upper limit. In my experiments, its Long Context performance could not closely match the training effect.

Fusion

We can also examine these methods from the perspective of the locality of language models. Locality refers to the fact that when a language model predicts the next token, it clearly depends more on neighboring tokens. Direct extrapolation maintains locality (position encodings near 0 remain unchanged), but performs poorly because it introduces position encodings beyond the training length. Position interpolation avoids extrapolating position encodings but disrupts locality (position encodings near 0 are compressed to 1/k), so it does not perform well without fine-tuning. NTK-aware Scaled RoPE implicitly combines the advantages of both through "high-frequency extrapolation, low-frequency interpolation," ensuring locality without significant extrapolation of position encodings, thus yielding good results without fine-tuning.

Is there a way to combine extrapolation and interpolation more directly? Yes. We can set a window size w. Within the window, we use a position interval of size 1; outside the window, we use a position interval of size 1/k. The entire relative position matrix is as follows: \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\text{\small $w + \frac{L-1-w}{k}$}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix} \label{eq:leaky-rerope} As long as w is smaller than the training length, by controlling k, we can ensure that all position encodings do not exceed the training length while precisely maintaining locality, simply and directly combining direct extrapolation and position interpolation.

Specifically, matrix [eq:leaky-rerope] has a special case: when k \to \infty, it simplifies to: \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{w} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{green}{w} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{green}{w} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix} \label{eq:rerope} In this case, no matter the input length, the range of its position encodings does not exceed w. Therefore, this is a scheme that potentially supports Context of arbitrary length!

Formally, the relationship between matrices [eq:rerope], [eq:leaky-rerope] and the standard RoPE matrix [eq:rope] is analogous to the relationship between ReLU, Leaky ReLU, and Linear functions. Thus, I call [eq:rerope] "ReRoPE (Rectified RoPE)" and [eq:leaky-rerope] "Leaky ReRoPE."

Calculation

In fact, similar ideas are not difficult to conceive. Relative position encodings based on Attention Bias (such as classic relative position encoding or T5 position encoding) often feature such block-wise operations. However, unlike those, implementing such block-wise operations in RoPE significantly increases the computational load, which is the main reason I had previously set this idea aside.

How should we understand the increased computational load? We know that RoPE "achieves relative position through absolute position," which only yields linear relative positions. Matrices [eq:leaky-rerope] and [eq:rerope] are non-linear (or piecewise linear). To implement them, one must calculate the Attention matrix twice and then combine them. Specifically, first calculate the Attention matrix using standard RoPE (before Softmax): a_{i,j}^{(1)} = \left(\boldsymbol{\mathcal{R}}^i\boldsymbol{q}_i\right)^{\top}\left(\boldsymbol{\mathcal{R}}^j\boldsymbol{k}_j\right) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}^{j-i}\boldsymbol{k}_j Here the first equality is the implementation method, and the second is the equivalent result, where \boldsymbol{\mathcal{R}} is the rotation matrix of RoPE. For simplicity, we omit the Attention scale factor. Next, we need to calculate the Attention matrix for RoPE with an interval of 1/k (Leaky ReRoPE): a_{i,j}^{(2)} = \left(\boldsymbol{\mathcal{R}}^{(i-w)/k+w}\boldsymbol{q}_i\right)^{\top}\left(\boldsymbol{\mathcal{R}}^{j/k}\boldsymbol{k}_j\right) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}^{(j-i+w)/k-w}\boldsymbol{k}_j If it is ReRoPE, it is simpler: a_{i,j}^{(2)} = \left(\boldsymbol{\mathcal{R}}^w\boldsymbol{q}_i\right)^{\top}\boldsymbol{k}_j = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}^w\boldsymbol{k}_j Finally, merge them based on the condition i - j < w: a_{i,j} = \left\{\begin{aligned} &a_{i,j}^{(1)},\quad (i - j < w) \\[8pt] &a_{i,j}^{(2)}, \quad (i - j \geq w) \end{aligned}\right. Whether it is ReRoPE or Leaky ReRoPE, calculating the Attention matrix twice is unavoidable (if there is a more efficient implementation, please let me know). This is one of the sources of increased computation. Furthermore, the need for custom Attention matrix calculation means that existing flash attention implementations cannot be directly applied, further increasing the computational cost.

On the other hand, also due to the non-linear relative positions, the Key sequence cache during auto-regressive decoding can only store values before RoPE. Then, at each decoding step, the corresponding RoPE must be applied to the entire Key sequence. This change also increases inference computation. The only good news is that during token-by-token decoding, starting from the second step, the Query sequence length is 1. At this point, we only need to customize the RoPE for the Key sequence, so the Attention matrix only needs to be calculated once: a_{i,j} = \left\{\begin{aligned} &\boldsymbol{q}_i^{\top}\left(\boldsymbol{\mathcal{R}}^{\max(j-i,-w)}\boldsymbol{k}_j\right), \quad(\text{ReRoPE})\\[8pt] &\boldsymbol{q}_i^{\top}\left(\boldsymbol{\mathcal{R}}^{\max(j-i,(j-i+w)/k-w)}\boldsymbol{k}_j\right), \quad(\text{Leaky ReRoPE}) \end{aligned}\right.

Experiments

Continuing with the settings from "Transformer Upgrade Road: 11. Taking the \beta-radix Position to the End", we conducted experiments on ReRoPE. The results are shown in the table below:

Test Length	512 (Train)	4096 (Repeat)	4096 (Non-repeat)
Baseline	49.41%	24.17%	23.16%
Baseline-\log n	49.40%	24.60%	24.02%
PI-RoPE	49.41%	15.04%	13.54%
PI-RoPE-\log n	49.40%	14.99%	16.51%
NTK-RoPE-old	49.41%	51.28%	39.27%
NTK-RoPE-\log n-old	49.40%	61.71%	43.75%
NTK-RoPE-fixed	49.41%	51.86%	39.61%
NTK-RoPE-\log n^{\textcolor{red}{\dagger}}-fixed	49.41%	55.94%	41.11%
NTK-RoPE-\log n-fixed	49.40%	62.85%	44.14%
NTK-RoPE-mixed	49.41%	53.09%	40.12%
NTK-RoPE-\log n^{\textcolor{red}{\dagger}}-mixed	49.41%	59.11%	42.38%
NTK-RoPE-\log n-mixed	49.40%	68.91%	45.41%
ReRoPE-w256	49.41%	77.90%	48.48%
ReRoPE-w256-\log n^{\textcolor{red}{\dagger}}	49.41%	82.40%	48.85%
ReRoPE-w256-\log n	49.40%	85.12%	49.07%
HFWA	48.70%	80.84%	48.15%

As mentioned at the beginning, the extrapolation effect of ReRoPE without fine-tuning is surprisingly good. It not only significantly surpasses the previously optimal NTK-RoPE-mixed but also clearly exceeds HFWA, which was pre-trained from scratch! Here, \text{w256} refers to w=256. \log n^{\textcolor{red}{\dagger}} refers to cases where \log n scaling was not included during pre-training (like Llama), but during testing, each \boldsymbol{q}_n is multiplied by \max(1, \log_{\text{maxlen}} n). \log n refers to cases where the \log n scaling factor was included during pre-training.

Below are some ablation experiments showing that ReRoPE is quite robust regarding w. The optimal value is roughly between 1/4 and 1/2 of the training length:

Test Length	512 (Train)	4096 (Repeat)	4096 (Non-repeat)
ReRoPE-w64	49.41%	69.39%	45.19%
ReRoPE-w64-\log n^{\textcolor{red}{\dagger}}	49.41%	78.58%	47.42%
ReRoPE-w64-\log n	49.40%	84.38%	48.14%
ReRoPE-w128	49.41%	76.11%	47.82%
ReRoPE-w128-\log n^{\textcolor{red}{\dagger}}	49.41%	82.28%	48.78%
ReRoPE-w128-\log n	49.40%	85.47%	48.87%
ReRoPE-w256	49.41%	77.90%	48.48%
ReRoPE-w256-\log n^{\textcolor{red}{\dagger}}	49.41%	82.40%	48.85%
ReRoPE-w256-\log n	49.40%	85.12%	49.07%
ReRoPE-w384	49.41%	70.72%	48.15%
ReRoPE-w384-\log n^{\textcolor{red}{\dagger}}	49.41%	76.42%	48.31%
ReRoPE-w384-\log n	49.40%	83.24%	48.62%
ReRoPE-w512	49.41%	7.09%	8.25%
ReRoPE-w512-\log n^{\textcolor{red}{\dagger}}	49.41%	7.08%	8.25%
ReRoPE-w512-\log n	49.40%	15.84%	10.83%

The following table compares ReRoPE and Leaky ReRoPE:

Test Length	512 (Train)	4096 (Repeat)	4096 (Non-repeat)
ReRoPE-w128-\log n	49.40%	85.47%	48.87%
Leaky ReRoPE-w128-k64-\log n	49.40%	85.29%	48.96%
Leaky ReRoPE-w128-k32-\log n	49.40%	85.31%	49.03%
Leaky ReRoPE-w128-k16-\log n	49.40%	85.15%	49.10%
Leaky ReRoPE-w128-k8-\log n	49.40%	80.00%	48.11%
ReRoPE-w256-\log n	49.40%	85.12%	49.07%
Leaky ReRoPE-w256-k64-\log n	49.40%	84.60%	49.03%
Leaky ReRoPE-w256-k32-\log n	49.40%	84.30%	48.97%
Leaky ReRoPE-w256-k16-\log n	49.40%	83.59%	48.87%
Leaky ReRoPE-w256-k8-\log n	49.40%	69.80%	45.72%

As a generalization of ReRoPE, a fine-tuned Leaky ReRoPE has the potential to outperform ReRoPE, but the improvement is very slight. Furthermore, when k takes a finite value, the maximum length it can handle is also finite. Since we cannot know the total length to be generated in advance, we can only preset a sufficiently large k. However, once set to a finite value, when the input is long enough, the performance will drop significantly because the position encoding exceeds the training length. In contrast, ReRoPE does not have this risk. Overall, the value of fine-tuning Leaky ReRoPE compared to ReRoPE seems small.

The above experimental results were all tested on a GAU model with 100 million parameters. Below are the test results based on Llama2-13b (the metric is loss, lower is better), representing performance on a real LLM:

Test Length	4096 (Train)	8192	16384
RoPE	1.4967	8.8615	-
NTK-RoPE	1.6081	1.5417	1.5163
ReRoPE	1.4996	1.4267	1.4001

As can be seen, ReRoPE truly achieves almost no loss in training performance (RoPE-4096 represents the training effect) and satisfies the ideal characteristic of "longer context, lower loss" (more context should be more helpful for prediction). Additionally, I tested the chat performance on the Llama2-13b fine-tuned model open-sourced by OpenBuddy, and it felt quite good (I tested up to 20k tokens of Context).

Finally, I am sharing the code for implementing ReRoPE and Leaky ReRoPE based on the Llama model in the transformers library. Readers can load Llama series models for testing themselves:

Github: https://github.com/bojone/rerope

Summary

In this article, I proposed ReRoPE (Rectified RoPE), which is also a post-processing scheme for RoPE. Experimental results show that its length extrapolation capability without fine-tuning not only significantly exceeds the previous NTK-aware Scaled RoPE but even surpasses the specially designed HFWA that requires training from scratch. Furthermore, unlike NTK-aware Scaled RoPE, whose capability drops sharply after exceeding a certain length, ReRoPE seems to perform well at any length. In addition to comparative experiments, the article provides a reference implementation based on transformers-llama for interested readers to test.

Original address: https://kexue.fm/archives/9708