Transformer Upgrade Road: 13. Inverse Leaky ReRoPE · English (unofficial) translations of posts at kexue.fm

Last week, in "Transformer Upgrade Road: 12. ReRoPE for Infinite Extrapolation?", I proposed ReRoPE and Leaky ReRoPE. Numerous experimental results showed that they can extend the context length of Large Language Models (LLMs) without fine-tuning and with almost no loss in training performance. They also achieved the ideal characteristic of "longer context, lower loss." Furthermore, unlike NTK-aware Scaled RoPE, ReRoPE even demonstrated an apparent capacity for infinite context processing.

In short, ReRoPE seems quite satisfactory. However, its main drawback is the increased inference cost, specifically manifested as the need to calculate Attention twice in the first step of inference and the need to recalculate position embeddings in each subsequent step. This article attempts to solve this problem by "inverting" the use of Leaky ReRoPE during the training phase.

Review

Let us revisit the concepts once more: RoPE is formally an absolute position encoding, but its practical effect is that of a relative position encoding. The corresponding relative position matrix is: \begin{pmatrix} 0 & \\ 1 & 0 & \\ 2 & 1 & 0 &\\ 3 & 2 & 1 & 0 & \\ \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \text{\small $L - 2$} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \text{\small $L - 1$} & \text{\small $L - 2$} & \ddots & \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \end{pmatrix} \label{eq:rope}

To maintain locality while avoiding the position out-of-bounds problem caused by long contexts, Leaky ReRoPE modifies the relative position matrix during the inference phase to: \small \addtolength{\arraycolsep}{-3pt} \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\text{\small $w + \frac{L-1-w}{k}$}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\text{\small $w + \frac{2}{k}$}} & \color{green}{\text{\small $w + \frac{1}{k}$}} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix} \label{eq:leaky-rerope} where w is the window width (usually taken as 1/4 to 1/2 of the training length), and k is used to adjust the maximum processable length. It is generally better to ensure that w + \frac{L-1-w}{k} does not exceed half of the training length. As for ReRoPE, it is simply the limit as k \to \infty: \small \addtolength{\arraycolsep}{-3pt} \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{w} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{green}{w} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{green}{w} & \color{green}{w} & \color{red}{\text{\small $w - 1$}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix} \label{eq:rerope}

Inversion

From the evaluation results of the previous article, as a training-free extrapolation scheme, both ReRoPE and Leaky ReRoPE are quite satisfactory. They do not lose performance within the training length and achieve "Longer Context, Lower Loss." The only drawback is that their inference speed is slower compared to the original Attention, and they are currently incompatible with acceleration techniques like Flash Attention.

So, can we do the opposite? ReRoPE/Leaky ReRoPE uses normal-speed RoPE during the training phase and slows down during the inference phase. Conversely: can we make the training phase slower and the inference phase use conventional RoPE? Some readers might wonder: why would we want to slow down the training phase? Wouldn’t the training cost be higher? This is because ReRoPE/Leaky ReRoPE is a length extrapolation method, where the scenario is "Train Short, Test Long." The slowdown in training is short-term and controllable, whereas the slowdown in inference is long-term and difficult to sustain. Therefore, by comparison, if the degree of slowdown is similar, we would prefer to move the slower part to the training phase.

Let’s look at Leaky ReRoPE again. In the training phase, its relative position matrix is Equation [eq:rope] with a step size of 1. In the inference phase, it uses a step size of 1 within the window w and a step size of \frac{1}{k} < 1 outside the window, as in Equation [eq:leaky-rerope]. In other words, the difference is that a smaller step size is used outside the window during inference. If we reverse this and use Leaky ReRoPE during the training phase with a step size outside the window greater than 1, then according to the principle of "using a smaller step size outside the window during inference," could the inference phase use a step size equal to 1 outside the window, thereby degrading back to standard RoPE?

I call this idea "InvLeaky ReRoPE (Inverse Leaky ReRoPE)." Without further ado, let’s conduct experimental tests.

Experiments

Continuing the previous experimental combination of "GAU + Deep Norm + Tiger + Language Model," we use Leaky ReRoPE with k=1/16, w=128 during the training phase and normal RoPE during the inference phase. The test results are as follows:

Test Length	512 (Train)	4096 (Repeated)	4096 (Non-repeated)
Baseline	49.41%	24.17%	23.16%
Baseline-\log n	49.40%	24.60%	24.02%
NTK-RoPE-fixed	49.41%	51.86%	39.61%
NTK-RoPE-\log n^{\color{red}{\dagger}}-fixed	49.41%	55.94%	41.11%
NTK-RoPE-\log n-fixed	49.40%	62.85%	44.14%
NTK-RoPE-mixed	49.41%	53.09%	40.12%
NTK-RoPE-\log n^{\color{red}{\dagger}}-mixed	49.41%	59.11%	42.38%
NTK-RoPE-\log n-mixed	49.40%	68.91%	45.41%
ReRoPE-w256	49.41%	77.90%	48.48%
ReRoPE-w256-\log n^{\color{red}{\dagger}}	49.41%	82.40%	48.85%
ReRoPE-w256-\log n	49.40%	85.12%	49.07%
InvLeaky ReRoPE-w128-\log n	49.38%	82.25%	48.32%
InvLeaky ReRoPE-w128-b8-\log n	49.62%	81.15%	48.85%
HFWA	48.70%	80.84%	48.15%

In the table, b8 means the RoPE base frequency was changed from 10,000 to 80,000. As can be seen, although the "Leaky ReRoPE \to RoPE" approach of InvLeaky ReRoPE is not as effective as "RoPE \to ReRoPE/Leaky ReRoPE," it still outperforms HFWA. Since the inference phase uses conventional RoPE, existing acceleration techniques can be applied, making it quite competitive. Additionally, I performed some simple hyperparameter tuning for k, w, b, etc., and found that the optimal solution is basically the two combinations above: "k is set to the reciprocal of twice the expansion factor, w is set to 1/4 of the training length, and b can optionally be multiplied by the expansion factor."

So, how much does InvLeaky ReRoPE affect training speed? In the above experiments, the model has 100 million parameters and a training length of 512. The training time per 1000 steps increased from 330 seconds to 350 seconds, an increase of less than 10%. Of course, this is partly due to the use of GAU, which uses single-head attention and is inherently faster than multi-head attention. If multi-head attention is used or the training length is longer, the increase might be larger, but it is estimated that an increase of no more than 50% should be acceptable.

Summary

This article proposes the "inverse" use of Leaky ReRoPE. By using Leaky ReRoPE with a larger step size during the training phase, the inference phase can revert to conventional RoPE, thereby maintaining the same inference speed. Experimental results show that this approach is quite competitive.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/9728

For more details on reprinting, please refer to: Scientific Space FAQ