Transformer Upgrade Road: 16. ``Reviewing'' Length Extrapolation Techniques · English (unofficial) translations of posts at kexue.fm

Looking back, I realize that since the 7th article, “Transformer Upgrade Road: 7. Length Extrapolation and Local Attention,” this series has been “obsessed” with length extrapolation. Nine consecutive articles (not counting this one) have revolved around this topic. Today, exactly one year after that 7th article, the open-source community has made significant progress in research on length extrapolation. I have also gradually developed my own understanding. For instance, the problem is far from as simple as initially imagined, and many previous works based on local attention are not always effective. This suggests that much of the older analytical work did not touch the core of the problem.

In this article, I attempt to combine my findings and insights to “review” mainstream length extrapolation results and try to discover the key to training-free length extrapolation.

Problem Definition

As the name suggests, training-free length extrapolation means that no additional training with long sequence data is required. By training the model only on short sequence corpora, one obtains a model capable of processing and predicting long sequences—i.e., “Train Short, Test Long.” So, how do we judge if a model can be used for long sequences? The most basic indicator is that the model’s long-sequence Loss or PPL (Perplexity) does not explode. A more practical evaluation involves inputting a sufficiently long context, having the model predict the answer, and comparing it with the ground truth using metrics like BLEU or ROUGE. LongBench belongs to this category of benchmarks.

However, it is important to note that length extrapolation should not come at the cost of sacrificing long-range dependencies—otherwise, considering length extrapolation becomes meaningless; one might as well just truncate the text. This means that solutions relying on explicit truncation of long-range dependencies must be chosen carefully. Examples include ALiBi and most of the schemes listed in Part 7, as well as Linear RNNs with explicit decay. These schemes behave as local attention when the sequence length is large enough. Even if they achieve length extrapolation, there is a risk of insufficient long-range dependency, which needs to be weighed according to specific scenarios.

How do we determine if long-range dependency is preserved during length extrapolation? A rigorous approach is the evaluation scheme proposed at the end of “Transformer Upgrade Road: 12. ReRoPE for Infinite Extrapolation?”. Prepare a sufficiently long text, but for each model, only calculate metrics for the final segment of each sample, as shown in the figure below:

[Image Placeholder: An evaluation method focusing on long-range dependency]

An evaluation method focusing on long-range dependency

For example, if the model training length is 4K and we want to see the effect of extrapolating to 16K, we prepare a test set of 16K tokens. The 4K model takes the last 4K tokens of each sample to calculate metrics; the 8K model takes the last 8K tokens but only calculates metrics for the final 4K tokens; the 12K model takes the last 12K tokens but only calculates metrics for the final 4K tokens, and so on. In this way, models of different lengths are evaluated on the same segment of tokens; the only difference is the length of the input context. If long-range dependency is effectively preserved, the metrics should improve as the context length increases.

Rotary Position

Having discussed evaluation, let’s return to methods. At the beginning of the article, I mentioned “older analytical work.” A major characteristic distinguishing the “new” from the “old” is that older works mostly tried to design new architectures or position encodings to achieve length extrapolation. In contrast, the “new” work of the past year mainly focuses on length extrapolation for Decoder-Only Transformer models using Rotary Positional Embedding (RoPE).

As a side note, why have most current LLMs chosen RoPE for position encoding? I believe there are several reasons:

1. RoPE does not have explicit long-range decay, which is crucial for models aiming for Long Context;

2. RoPE is a true position encoding. Through trigonometric functions of different frequencies, it effectively distinguishes between long-range and short-range, achieving an effect similar to hierarchical position encoding, which is also a key part of Long Context;

3. RoPE acts directly on Q and K and does not change the form of Attention, making it more compatible with Flash Attention and easier to scale up.

In contrast, methods like ALiBi and KERPLE, though sometimes called position encodings, are actually types of Attention Bias. They don’t contain much positional information and are not suitable for Encoders. They work for Decoders largely because the Decoder’s own lower-triangular mask already provides sufficient positional bias; the extra Attention Bias is just icing on the cake. Furthermore, they cannot effectively distinguish between long-range and short-range within a single head; instead, they require setting different decay factors for different heads. This also means they perform poorly when used with single-head attention (such as GAU).

This comparison of pros and cons might seem like “the merchant praising his own goods,” but that’s not the case. It’s just to exchange viewpoints, as some readers have asked similar questions. As the proposer of RoPE, my understanding of it is not necessarily deeper than anyone else’s. After all, the original intention of proposing RoPE was purely for fun; the thought at the time was that if it worked at all, it would be great, and if it could rival learnable absolute position encodings, that would be excellent news. So, since it was “unexpected,” the fact that “the author himself doesn’t have a very thorough understanding” is also “within reason.”

Window Truncation

I’ve strayed from the topic again. Simply put, the content of the previous two sections mainly intended to express: currently, RoPE seems sufficient for Long Context, so studying RoPE’s length extrapolation is valuable, and when choosing an extrapolation scheme, we should not sacrifice long-range dependency capabilities.

In the earliest discussion of length extrapolation on this site, Part 7, we judged length extrapolation to be an OOD (Out Of Distribution) problem at the prediction stage. Although some comments in that article seem a bit dated today, this fundamental judgment remains correct. In the context of RoPE, it means that unseen relative distances appear during the inference stage. To address this, a seemingly feasible solution is to introduce a Sliding Window Attention Mask, as shown on the left below:

[Image Placeholder: Sliding Window Mask]

[Image Placeholder: \Lambda-shape Window Mask]

\Lambda-shape Window Mask

Of course, because it forcibly truncates attention outside the window, this scheme does not satisfy the principle of “not sacrificing long-range dependency,” but we can view it as a baseline. Unfortunately, even with such a sacrifice, this scheme does not work—it can’t even prevent the basic PPL from exploding! In-depth analysis of this phenomenon led to two papers: “LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models” and “Efficient Streaming Language Models with Attention Sinks,” which provided almost identical answers. In fact, several months earlier, an “outsider” discovered the same conclusion and published it in a Zhihu article, “Perpetual Sampling Technical Report.”

The answer might be surprising: The first few tokens are very important and cannot be discarded. Therefore, the final usable Window Mask should be as shown on the right above (the LM-Infinite paper calls it a “\Lambda-Mask”).

Why do the initial tokens occupy such an important position? There are currently two different perspectives of understanding:

1. The initial tokens are “anchors” for absolute position: As the name implies, relative position encodings can theoretically only identify relative positions. However, some tasks may rely on absolute positions. By using the first few tokens (whose absolute position is approximately 0) as “benchmarks,” each token can measure its own absolute position. Removing the initial tokens loses this link, completely disrupting the attention pattern and leading to PPL explosion.

2. The initial tokens are attention “sinks”: Since attention sums to 1, it must be allocated to some tokens. In some cases, the model might find that “no tokens are worth attending to.” In such instances, it chooses to place a portion of the attention on the first few tokens, which contain little information, serving as a way to “not attend.” Removing them forces the model to allocate attention to other irrelevant tokens, thereby disrupting the attention pattern.

In short, empirical evidence shows that in most cases, the attention weight of the first few tokens is very heavy, so they cannot be removed; otherwise, the attention becomes chaotic. As for why they are heavy, that’s up to your imagination.

Position Interpolation

While window truncation can serve as a decent baseline for length extrapolation, and the findings regarding “anchors” or “sinks” further our understanding of how attention mechanisms work, it is not the ultimate solution because it sacrifices long-range dependency by forcibly truncating attention outside the window.

The OOD nature of relative positions manifests directly as relative positions during the prediction stage exceeding the range seen during training. Since they were never trained, the behavior in the “out-of-bounds” part is unpredictable. To this end, a user named “kaiokendev” proposed a very simple solution in his blog “Extending Context to 8K”: “Position Interpolation.” This involves multiplying the position encoding of the long text during prediction by a factor \frac{L_{train}}{L_{test}} to scale it back into the training length range, as shown in the following formula (where positions are relative):

\begin{equation} \begin{aligned} &\text{Training stage}:\,(1,2,\cdots,n-1,n)\\[5pt] &\text{Prediction stage}:\,(1,2,\cdots,n,\underbrace{n+1,\cdots,4n-1,4n}_{\text{Far out-of-bounds}})\xrightarrow{\quad\text{Interpolation}\quad} \big(\underbrace{\frac{1}{4},\frac{2}{4},\frac{3}{4}}_{\text{Local distortion}},\cdots,n-\frac{1}{4},n\big) \end{aligned} \end{equation}

Shortly thereafter, Meta released the same method in the paper “Extending Context Window of Large Language Models via Positional Interpolation,” naming it “Positional Interpolation (PI)” and providing comprehensive experimental results.

However, Position Interpolation is not strictly a length extrapolation scheme—at least not a training-free one—because PPL still explodes after interpolation. The reason is not hard to understand: although interpolation avoids the far-range out-of-bounds problem, it simultaneously compresses the distance between adjacent tokens, severely disrupting the model’s local resolution. Since language modeling is a task heavily dependent on local relationships, disrupting the local structure naturally prevents accurate prediction.

Nevertheless, this doesn’t mean Position Interpolation is valueless. We know that readers who need length extrapolation fall into two categories: one group lacks the resources for long-text fine-tuning and hopes to get a usable long-text model directly from a short-text model; PI is not suitable for them. The other group has the resources for fine-tuning and studies length extrapolation purely to get a better initialization. For them, the initial loss caused by model modification is tolerable as long as the performance can be quickly recovered through fine-tuning. PI belongs to this latter category. Meta’s paper shows that after PI, only about 1,000 steps of long-text training are needed to obtain an effective long-text model, which is much more efficient than fine-tuning without any modifications.

Preserving Near, Compressing Far

Direct extrapolation suffers from far-range out-of-bounds issues, while position interpolation suffers from local distortion. Since they seem complementary, can we combine the strengths of both? This is the idea behind Leaky ReRoPE, proposed in Part 12, and its limit version, ReRoPE.

Based on the previous analysis, it’s easy to infer that the key to training-free length extrapolation is “preserving the near and compressing the far”—that is, “ensuring no local distortion” and “compressing the far range to avoid out-of-bounds.” Leaky ReRoPE achieves this through a very direct approach: it sets a window size w. Within the window, relative positions are unchanged to ensure “no local distortion”; outside the window, position interpolation is used to ensure “no out-of-bounds,” as shown in the following matrix:

\begin{equation} \begin{pmatrix} \color{red}{0} \\ \color{red}{1} & \color{red}{0} \\ \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w} & \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w + \frac{1}{k}} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w + \frac{2}{k}} & \color{green}{w + \frac{1}{k}} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{\ddots} & \color{green}{w + \frac{2}{k}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w + \frac{2}{k}} & \color{green}{w + \frac{1}{k}} & \color{green}{w} & \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w + \frac{L-1-w}{k}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w + \frac{2}{k}} & \color{green}{w + \frac{1}{k}} & \color{green}{w} & \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \end{pmatrix} \end{equation}

If the interpolation factor k is taken to infinity, we get the simplified ReRoPE. Its position encoding outside the window becomes w, meaning it will never go out of bounds for any sequence length, theoretically possessing infinite extrapolation potential! In fact, both Leaky ReRoPE and ReRoPE perform very well. From a Loss perspective, they achieve almost no loss in performance within the training length while enabling length extrapolation. Furthermore, the longer the context, the lower the Loss, indicating that they indeed preserve long-range dependency while extrapolating.

The main issue with Leaky ReRoPE and ReRoPE is that their implementation is slightly cumbersome. Unlike Attention Bias-type position encodings, RoPE cannot be implemented by first constructing a relative position matrix and then calculating the encoding (that would be too inefficient). It must be implemented as an absolute position encoding to achieve relative position effects. This means it can only implement linearly increasing relative positions. Since the relative positions in Leaky ReRoPE and ReRoPE are piecewise linear, a naive implementation would require calculating the Attention matrix twice (for two different linear segments) and then splicing them together, which significantly reduces efficiency.

However, the good news is that current mainstream Attention acceleration methods, like Flash Attention, calculate Attention in blocks (e.g., blocks of 128 length). When the sequence is long enough, the piecewise linear blocks account for a very small proportion (only near the window boundary). As shown in the matrix below, only the red-green mixed blocks need repeated Attention calculation; the remaining monochromatic blocks only need to be calculated once. Therefore, when combined with block-based Attention calculation, the additional computational cost of Leaky ReRoPE and ReRoPE is almost negligible. Previously, reader @chu-tianxiang shared a Triton-based implementation in the comments section, which you can refer to.

\begin{equation} \left(\begin{array}{cccc:cccc:cccc} \color{red}{0} \\ \color{red}{1} & \color{red}{0} \\ \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \hdashline \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w} & \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w + \frac{1}{k}} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w + \frac{2}{k}} & \color{green}{w + \frac{1}{k}} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \hdashline \color{green}{\ddots} & \color{green}{w + \frac{2}{k}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w + \frac{2}{k}} & \color{green}{w + \frac{1}{k}} & \color{green}{w} & \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \\ \color{green}{w + \frac{L-1-w}{k}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w + \frac{2}{k}} & \color{green}{w + \frac{1}{k}} & \color{green}{w} & \color{red}{w - 1} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} \end{array}\right) \end{equation}

Coincidentally, a paper titled “LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning” was submitted to Arxiv earlier this month. It proposes a training-free length extrapolation method called “Self-Extend,” which is essentially Leaky ReRoPE with an added Round operation (rounding to the nearest integer) to make every relative position an integer, further mitigating the relative position OOD problem. The paper reports excellent results, further confirming the effectiveness of Leaky ReRoPE.

Rotation Perspective

Although Leaky ReRoPE and ReRoPE perform quite well in practice (at least in terms of Loss), they, like position interpolation, directly manipulate position IDs. This feels like “treating the symptoms rather than the cause” and lacks an in-depth analysis of the underlying patterns. For the model, position IDs are not important; position embeddings are what interact directly with the model. To “reach the root of the problem,” we should try to start from the position embeddings.

Some readers might wonder: isn’t there a one-to-one correspondence between position IDs and position embeddings? Isn’t manipulating position IDs equivalent to manipulating position embeddings? While that’s true, their actual behavior is different. For example, position IDs are unbounded, but position embeddings are bounded (RoPE consists of trigonometric functions, which are bounded). Since the model interacts with position embeddings, analyzing from that perspective allows us to clearly understand the specific OOD behavior caused by length extrapolation and thus “prescribe the right medicine.”

In Part 2, when we derived RoPE, we first used complex numbers to derive a 2D solution and then concatenated multiple 2D solutions into a high-dimensional one. Consequently, the dot product of \boldsymbol{q} and \boldsymbol{k} after adding RoPE can be expressed in complex form as:

\begin{equation} (\boldsymbol{\mathcal{R}}_m \boldsymbol{q})^{\top}(\boldsymbol{\mathcal{R}}_n \boldsymbol{k}) = \text{Re}\left[\sum_{i=0}^{d/2-1}\boldsymbol{q}_{[2i:2i+1]}\boldsymbol{k}_{[2i:2i+1]}^* e^{\text{i}(m-n)\theta_i}\right] \end{equation}

where \theta_i defaults to 10000^{-2i/d}, a function that gradually changes from 1 to nearly 0. From Euler’s formula e^{\text{i}t}=\cos t + \text{i}\sin t, we know that e^{\text{i}(m-n)\theta_i} is actually a point on the unit circle. As m-n increases, this point rotates around the unit circle (true rotation). The larger \theta_i, the faster it rotates; the smaller \theta_i, the slower.

[Image Placeholder: Rotating more than one full circle]

[Image Placeholder: Rotating less than one full circle]

Rotating less than one full circle

Assume the training length is L_{train}, so m-n\in[0, L_{train}-1]. Now let’s use our imagination: a larger \theta_i means a faster rotation speed and a shorter period. Thus, during the interval where m-n goes from 0 to L_{train}-1, it has already rotated many times, meaning almost every point on the circle has been trained. Therefore, these \theta_i values have almost no OOD problem. Conversely, for smaller \theta_i, when m-n goes from 0 to L_{train}-1, it might not have completed even one rotation. In this case, the trained points are at most an arc on the circle. If a larger L_{test} is encountered during testing, it exceeds the range of the trained arc, leading to unpredictable behavior. At this point, interpolation is needed to compress it back into the original arc. Simply put, whether the position ID m-n is OOD is not important; what matters is whether the points on the unit circle have been sufficiently trained. If they have, no change is needed (direct extrapolation); otherwise, we must find a way to compress them back onto the arc that has been sufficiently trained (position interpolation).

Specifically, for \theta_i, we can calculate the period T_i=2\pi/\theta_i and then the number of “rotations” during training as r_i=\frac{L_{train}}{T_i}=\frac{\theta_i L_{train}}{2\pi}. We can set a rotation threshold \tau. If the number of rotations exceeds \tau, we consider it sufficiently trained and leave it unchanged. If the number of rotations is less than 1, \theta_i is changed to \frac{\theta_i L_{train}}{L_{test}}, meaning the range exceeding the arc is scaled back. For the remaining part, we use linear interpolation to transition between the two. Expressed as a formula:

\begin{equation} \theta_i^{new} = \left[\gamma_i + (1 - \gamma_i)\frac{L_{train}}{L_{test}}\right]\theta_i,\quad \gamma_i = \left\{\begin{aligned}&1,&r_i > \tau \\ &0,&r_i < 1 \\ &\frac{r_i - 1}{\tau - 1},&\text{others} \end{aligned}\right. \end{equation}

This is the training-free length extrapolation scheme “YaRN” proposed in “YaRN: Efficient Context Window Extension of Large Language Models.” In my tests, its extrapolation effect is excellent, only slightly inferior to Leaky ReRoPE and ReRoPE. However, it should be noted that YaRN only changes the value of \theta_i and does not change the form of Attention or RoPE. Therefore, it incurs no additional implementation or inference costs. Under the condition that it can be directly integrated into existing implementations, YaRN is the best length extrapolation method I have tested.

Some Interludes

The story of YaRN doesn’t end there. In addition to the modification of \theta_i, YaRN introduces an extra Scale factor to the Attention Logits:

\begin{equation} \lambda = \left(1 + 0.1 \log \frac{L_{test}}{L_{train}}\right)^2 \approx 1 + 0.2 \log \frac{L_{test}}{L_{train}} \label{eq:scale-yarn} \end{equation}

The derivation of this Scale might be a bit amusing: there is no derivation. The author stated that he couldn’t derive it theoretically; it was purely an experimental discovery that adding this Scale resulted in a lower PPL, and the form was fitted through experiments.

Actually, this logarithmic result is very similar to the \log n Scale derived in “Attention Scale Operation from the Perspective of Entropy Invariance.” The difference is that the latter depends on the specific position, while the former is a constant once L_{test} is determined. Considering that the \log n function changes slowly when n is large, treating it as a constant within a certain range is justifiable. Therefore, it’s easy to guess that YaRN’s Scale factor shares the same origin as the entropy invariance \log n Scale. I have also done a comparison: replacing the constant \lambda with the following factor related to absolute position n yields similar results:

\begin{equation} \lambda_n = \max\left(1, \frac{\log n}{\log L_{train}}\right) \label{eq:clip-logn} \end{equation}

Note that: \begin{equation} \frac{\log L_{test} }{\log L_{train}} = 1 + \frac{1}{\log L_{train}} \log\left(\frac{L_{test}}{L_{train}}\right) \end{equation}

YaRN’s experiments were based on LLaMA and LLaMA2. The former has a training length of 2K, and the latter 4K. We have \frac{1}{\log 2048}\approx 0.13 and \frac{1}{\log 4096}\approx 0.12. The coefficients are roughly half of those in Eq. [eq:scale-yarn], which is not a huge difference. In fact, the exact value of this coefficient might not be important, as I have found datasets where Eq. [eq:clip-logn] performs better. Thus, we can consider Eq. [eq:scale-yarn] to be approximately derived.

Compared to YaRN itself, the story of its author, Bowen Peng, is perhaps even more “fascinating.” He previously proposed NTK-RoPE, the first training-free length extrapolation scheme for RoPE. Parts 10 and 11 of this series were directly inspired by it. Although NTK-RoPE’s performance might not be the best today (compared to YaRN or ReRoPE), it was the first to demonstrate the possibility of training-free length extrapolation, making it a milestone. One could even say that all subsequent research on length extrapolation has directly or indirectly benefited from the imagination sparked by NTK-RoPE.

The idea of NTK-RoPE is simple: just change the base of RoPE. Originally \theta_i = 10000^{-2i/d}, it becomes \theta_i = (10000\kappa)^{-2i/d}. How to choose \kappa? Based on his experience with NTK (Neural Tangent Kernel) results, Bowen Peng judged that high frequencies (i\to 0) learn relative distances and should not be changed, while low frequencies (i\to d/2-1) learn absolute distances and should be interpolated. This is summarized as “high-frequency extrapolation, low-frequency interpolation.” By setting the scale at i = d/2-1 to be exactly equal to the interpolation scale \frac{L_{train}}{L_{test}}, he obtained the equation:

\begin{equation} (10000\kappa)^{-2i/d}|_{i=d/2-1} = \left.\frac{L_{train}}{L_{test}}10000^{-2i/d}\right|_{i=d/2-1} \end{equation}

Solving for \kappa: \begin{equation} \kappa = \left(\frac{L_{test}}{L_{train}}\right)^{d/(d-2)} \label{eq:kappa} \end{equation}

This simple and clever derivation opened the “Pandora’s box” of training-free length extrapolation.

From the perspective of YaRN, it’s not just the \theta_i at i = d/2-1 that rotates less than one full circle. Thus, NTK-RoPE’s approach of only performing full interpolation for the last i = d/2-1 is insufficient. Indeed, setting \kappa as in Eq. [eq:kappa] only allows the model to extrapolate to about L_{test}/2 without PPL explosion; beyond that, PPL rises significantly. Because of this issue, the author proposed the upgraded YaRN.

However, although NTK-RoPE is inferior to YaRN in performance, the second group of readers (those with resources for fine-tuning) might prefer NTK-RoPE. Since they only need a better initialization and will fine-tune anyway, they don’t care much about the initial performance difference between NTK-RoPE and YaRN. They prefer the simpler implementation of NTK-RoPE. For example, CodeLLAMA was trained by changing the base to 10^6 on top of LLaMA2. Furthermore, Meta, in its paper “Effective Long-Context Scaling of Foundation Models,” renamed NTK-RoPE to RoPE-ABF (Adjusted Base Frequency), which is more intuitive than the mysterious “NTK.”

Refusing to Pay the Tax

You might have noticed that the training-free length extrapolation methods mentioned above cannot keep the model’s performance within the training length L_{train} unchanged. Specifically, if the original model is f(x) and the modified model is f^+(x), it cannot be guaranteed that f(x)\equiv f^+(x) when the length of x does not exceed L_{train}. Since f(x) was trained on L_{train}, it is reasonable to assume f(x) is optimal for samples within that length. Thus, f^+(x)\neq f(x) means that while length extrapolation improves performance on longer samples, it degrades performance on the original L_{train} range. We can figuratively call this loss the “extrapolation tax.”

As early as when NTK-RoPE was first proposed, the open-source community recognized the “extrapolation tax” and proposed a solution: dynamically adjusting the scale factors of various extrapolation methods as the sequence length changes. This is Dynamic Scaling, first proposed in a Reddit post “Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning.” Taking YaRN as an example, where the length-related scaling factor is s=\frac{L_{test}}{L_{train}}, Dynamic Scaling replaces it with a dynamic s(pos)=\frac{\max(L_{train}, pos+1)}{L_{train}}, where pos is the current token’s position index (starting from zero). This change means Dynamic Scaling attempts to find the smallest scale factor for each position that theoretically has the least impact on model performance (or equivalently, each position gets a different \theta_i(pos)), thereby refusing to pay the tax.

However, it is difficult to truly implement a different \theta_i(pos) for every position. For the same reason that Leaky ReRoPE and ReRoPE require repeated Attention calculations: RoPE achieves relative positions through absolute positions, meaning a single calculation can only implement one fixed \theta_i. To have different \theta_i for different positions, the K in the KV Cache must be stored before applying RoPE, and different positions must be calculated multiple times, turning it into a recursive process similar to an RNN. We know that LLM response involves two stages: prefill and generation. Prefill refers to the calculation of the input part, and generation is the token-by-token stage. Obviously, the prefill stage is originally parallelizable. If it were changed to be recursive like generation, it would significantly slow down computation when the input is long (e.g., inputting a whole paper), making it impractical.

Thus, a compromise is “local static”: we can calculate how many tokens are in the prefill stage and set a max_gen_tokens for the generation stage. We add these two numbers to use as the L_{test} for the current conversation to calculate the corresponding \theta_i. After completing this conversation, we update L_{test} and \theta_i for the next one in the same way. This way, we don’t introduce overly complex or efficiency-sacrificing implementations. It’s a practical solution, especially since when the input is long, max_gen_tokens is much smaller than the prefill tokens, so the Scale is approximately constant within a single conversation.

The idea of Dynamic Scaling was taken to the extreme by CLEX, proposed in “CLEX: Continuous Length Extrapolation for Large Language Models.”. CLEX also assigns a different \theta_i(pos) to each position, assuming \theta_i(pos) is a continuous function of pos and modeling it with a Neural ODE. By fine-tuning to fit the parameters of this ODE, it achieved better results than YaRN. Experimental results show that by continuing Dynamic Scaling, nearly infinite length extrapolation capability can be obtained.

Starting Anew

Besides Dynamic Scaling, another approach to “refusing to pay the tax” is “starting anew”—redesigning the model architecture used during pre-training so that it has the potential for length extrapolation without any modification after training. In this series, I have explored two related ideas: HWFA (Hybrid Window-Full Attention) in Part 9 and Key Norm in Part 15.

In HWFA, the first L-1 layers of Attention are replaced with RoPE + Window Attention with a small window, while the final layer is replaced with NoPE + Full Attention. Models trained this way have some length extrapolation effect without modification. A similar idea is found in “Focused Transformer: Contrastive Training for Context Scaling,” though that paper aims to extend context length through simple fine-tuning rather than extrapolation. The problem with HWFA is that its training performance is inferior to standard Attention models. To address this, I proposed an improved version, HWFA2 (HWFA + ReRoPE), in Part 14.

Compared to HWFA, HWFA2 uses a larger Window Size for Window Attention, restores RoPE for Full Attention, and allows more than one layer of Full Attention to be interspersed among Window Attention layers (rather than just one at the end). These modifications can match the training performance of standard Attention (sometimes even exceeding it), but the downside is that it cannot achieve length extrapolation without modification (RoPE must be replaced with ReRoPE). This is a trade-off. Of course, we can ignore the extrapolation effect and simply view HWFA2 as an acceleration scheme that reduces model complexity without losing performance. Incidentally, an Arxiv paper last month, “Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention,” proposed a method called Zebra, which, like HWFA2, uses a combination of several Full Attention layers interspersed with Window Attention.

As for Key Norm, it originated from the “accidental discovery” that length extrapolation capability significantly improved after applying L2 normalization to the Attention Keys. Further reflection on this deepened my understanding of length extrapolation. For standard Attention based on Q and K dot products, we can express it as:

\begin{equation} s(n|m) = \boldsymbol{q}_m\cdot \boldsymbol{k}_n = \Vert\boldsymbol{q}_m\Vert \Vert\boldsymbol{k}_n\Vert \cos(\boldsymbol{q}_m,\boldsymbol{k}_n),\quad p(j|i) = \frac{\exp\left(\frac{s(n|m)}{\sqrt{d}}\right)}{\sum\limits_{j=1}^i \exp\left(\frac{s(n|m)}{\sqrt{d}}\right)} \end{equation}

Clearly, to increase the relative attention of n for a given m, the model has two choices: increase \Vert\boldsymbol{k}_n\Vert or increase \cos(\boldsymbol{q}_m,\boldsymbol{k}_n). Due to the curse of dimensionality, increasing \Vert\boldsymbol{k}_n\Vert is much easier than increasing \cos(\boldsymbol{q}_m,\boldsymbol{k}_n). If possible, the model will choose to increase \Vert\boldsymbol{k}_n\Vert. Since \Vert\boldsymbol{k}_n\Vert is independent of i, it describes absolute importance. This might be one of the causes of the attention distribution characteristics described in Scissorhands. On the other hand, if the model tends to increase \Vert\boldsymbol{k}_n\Vert, it means the training of \cos(\boldsymbol{q}_m,\boldsymbol{k}_n) might be insufficient. This is likely the more fundamental reason why Attention cannot extrapolate.

Thus, the reason Key Norm improves extrapolation becomes clear. Key Norm normalizes all \Vert\boldsymbol{k}_n\Vert to 1, leaving the model with no choice but to focus on adjusting \cos(\boldsymbol{q}_m,\boldsymbol{k}_n), making its training more thorough. I have also conducted comparative experiments showing that Key Norm only demonstrates extrapolation capability when paired with RoPE. Key Norm + NoPE or pure NoPE shows no extrapolation effect. This is likely because RoPE’s rotation increases the diversity of the angles between \boldsymbol{q}_m and \boldsymbol{k}_n (acting like data augmentation), thereby making the training of \cos(\boldsymbol{q}_m,\boldsymbol{k}_n) more robust.

Another paper, “CoCA: Fusing position embedding with Collinear Constrained Attention for fine-tuning free context window extending,” proposed a solution from a different angle: it modifies the implementation of attention so that for each pair of \boldsymbol{q}_m^{(i)},\boldsymbol{k}_m^{(i)}, we have \cos(\boldsymbol{q}_m^{(i)},\boldsymbol{k}_m^{(i)})=1. Here, the grouping i refers to the pairwise grouping of \boldsymbol{q},\boldsymbol{k} components in RoPE. This design ensures that large \cos(\boldsymbol{q}_m,\boldsymbol{q}_n) values are mostly trained (since the maximum \cos is 1). The insufficiently trained parts are only the small ones (which have low Softmax probabilities and won’t significantly interfere with the attention distribution), thereby gaining some extrapolation capability. However, CoCA’s modification of attention risks lowering the upper bound of each attention head’s capability—i.e., with the same parameters, it might only have the fitting capacity of a standard attention head with head_size/2.

Other Ideas

We are nearing the end of this introduction to length extrapolation. Despite the length of this article, it is still difficult to introduce all work in detail. Below are some other related works I can recall.

Initially, we thought Attention couldn’t extrapolate because of position “out-of-bounds” during prediction. A naive solution is to perturb the position encodings during training—a form of data augmentation—to let the model adapt to the position encodings used in prediction. Part 8 and Part 13 of this series fall into this category, as does “PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training” from a few months ago. These methods were not very stable in my experiments and introduced extra complexity or randomness, making it hard to guarantee they wouldn’t affect the model’s original Scaling Law.

Some readers have asked: in the YaRN analysis, it’s the low-frequency part that needs interpolation. What if we just remove the low-frequency part? Or similarly, decrease the base to increase the proportion of high-frequency parts? I did try decreasing the RoPE base during pre-training, and the result was that the final performance was worse and showed no extrapolation capability. However, “Scaling Laws of RoPE-based Extrapolation” (there is a Chinese version on Zhihu) tried another approach: decreasing the Base only during the fine-tuning stage. Combined with short-text fine-tuning, this can demonstrate long-text extrapolation capability.

From my perspective, decreasing the Base or removing low frequencies is not scientific. Even if it might have an extrapolation effect in some cases, it might sacrifice the model’s inherent capabilities. As Bowen Peng once argued, high frequencies learn local relative distances and low frequencies learn long-range absolute distances; both are important and exist in a hierarchical relationship. From the perspective of Part 10 (RoPE as a \beta-ary encoding), low frequencies correspond to high-order digits. If you only keep low-order digits and discard high-order ones, it’s like taking a modulo (remainder), which cannot accurately express position information. Moreover, high and low frequencies are relative; a frequency that is low for a 10K text might be high for a 100K text.

A recent interesting paper, “Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use,” found that averaging the outputs of the same model with different bases can enhance overall performance. This suggests that different bases have different strengths and shouldn’t be decreased solely for extrapolation.

Overall, while length extrapolation technology has made great strides, it remains a mysterious subject. For example, switching RoPE to ReRoPE during inference shows some extrapolation effect, so would switching to ReRoPE during pre-training yield even better results? On the contrary, I conducted experiments switching to ReRoPE during training, and the resulting model had zero extrapolation capability. This likely relates to the Key Norm analysis: using ReRoPE during training reduces the diversity of angles between \boldsymbol{q}_n and \boldsymbol{k}_m, making the training of \cos(\boldsymbol{q}_n,\boldsymbol{k}_m) less thorough, thereby reducing extrapolation capability. Many extrapolation techniques might also be tied to the architecture. Some early position encodings claimed to have extrapolation capabilities (ALiBi, KERPLE, XPOS, etc.) were tested on Multi-Head Attention + Pre Norm. In my tests with Single-Head GAU + Post Norm, I never found them to have extrapolation capabilities. This suggests that our analysis of length extrapolation might still be missing the architecture-related piece of the puzzle.

Summary

In this article, I have combined my learning experiences to review the progress in length extrapolation over the past year. I have briefly introduced the characteristics and underlying ideas of relevant methods and attempted to connect them. I hope this article helps everyone gain a deeper and more systematic understanding of the subject of length extrapolation. If there are any errors or omissions, please feel free to point them out.