The Road to Transformer Upgrade: 9. A New Idea for Global Length Extrapolation · English (unofficial) translations of posts at kexue.fm

When discussing why Transformers cannot handle ultra-long sequences, the first reaction is usually the quadratic complexity of Self-Attention. However, in reality, even if computational limits are ignored, conventional Transformers still cannot handle ultra-long sequences because their Length Extrapolation is poor. This is specifically manifested as a significant drop in model performance when the input sequence length significantly exceeds the training length.

Although there has been some related work, the problem of length extrapolation is still far from being practically solved. This article introduces a reference scheme conceived by the author, which might be the only length extrapolation method currently applicable to generative models that possesses global dependency capabilities.

Method Review

Length extrapolation, also known as length generalization, has been partially introduced in previous posts: "The Road to Transformer Upgrade: 7. Length Extrapolation and Local Attention" and "The Road to Transformer Upgrade: 8. Length Extrapolation and Position Robustness". However, they each have their own issues.

The various schemes introduced in the first article all follow the idea of localizing attention. Although improvements can be seen in metrics, the essence is just slightly better metrics; they cannot achieve extrapolation of global dependencies. Therefore, they offer no substantial help for scenarios that truly require long-range dependencies (such as In-Context Learning). The latter enhances robustness to position signals through random position perturbations, which theoretically could preserve global dependencies, but this method is only suitable for Encoder models and not for autoregressive generative models like GPT.

Therefore, the length extrapolation problem remains an urgent but unsolved issue for Transformers. In fact, this problem exists not only in Transformers but also in linear RNN models (including the popular RWKV) introduced in "Google’s New Work Attempts to ’Revive’ RNNs: Can RNNs Shine Again?", whose length extrapolation capabilities are also not good. In the current LLM era, length extrapolation capability is particularly important because we always hope models can handle arbitrarily long text, yet it is impossible to extend the training sample length to be arbitrarily long.

Translation Invariance

Next, we will introduce the method focusing on autoregressive Transformers, though it is also effective for bidirectional attention Encoders. Essentially, localized attention grants the entire model "translation invariance" by limiting the perception range of attention. A simple baseline for translation invariance is Window Attention, as shown below:

[Image: Window Attention (SVG Format - Click to View)]

Window Attention

[Image: Receptive Field Stacking Diagram (SVG Format - Click to View)]

Receptive Field Stacking Diagram

Suppose the model contains L layers of stacked Window Attention with a window size of w. In that case, the maximum receptive field for each token in the last layer is (w-1)L+1. Therefore, assuming the training length is N, under the constraint (w-1)L+1 = \alpha N \, (0 < \alpha \leq 1), the model can obtain a certain degree of translation invariance. This is because the maximum receptive field of the model does not exceed N, so the total receptive field of the model is sufficiently trained. Generally, the smaller the \alpha, the better the translation invariance.

However, while this ensures translation invariance, it brings other problems. The most serious is that because the receptive field of each layer is limited to w, the capability of the attention mechanism is greatly weakened, resulting in training effects inferior to regular attention (hereinafter referred to as Full Attention). Furthermore, our expectation for length extrapolation is not just "translation invariance," but "translation improvement," meaning the performance should get better as the sequence progresses (e.g., in In-Context Learning scenarios, the more examples provided, the better the performance should be). Therefore, the model should also be able to capture global dependencies.

Global Dependency

To this end, the author thought: the results obtained by Window Attention are essentially some kind of n-gram features, except that n becomes larger under multi-layer stacking. A single layer of Full Attention can be seen as a kind of "retrieval" (as seen from terms like query, key, and value) and "fusion." Its patterns are relatively easy to analyze. Previously, in "Looking at the Scale Operation of Attention from Entropy Invariance", we concluded that single-layer (Full) Attention can enhance length extrapolation by adding a \log n scaling factor.

Therefore, the author conceived an idea:

If the first L-1 layers obtain n-gram features through Window Attention, can the last layer be replaced with Full Attention with a \log n factor to retrieve and integrate these features, thereby making up for the performance gap and gaining global dependency capabilities?

To this end, we propose the following attention combination method (Hybrid Window-Full Attention, referred to as HWFA):

The first L-1 layers use "Window Attention + RoPE" with window size w, satisfying the constraint (w-1)(L-1)+1 = \alpha N, where N is the training length. To balance training and extrapolation effects, it is recommended to choose the largest possible w under the premise of \alpha \leq 3/4.
The L-th layer uses Full Attention with a \log n factor but does not use RoPE.

The reason for using RoPE in the earlier layers is that many experimental results have shown that RoPE helps enhance model performance (at least for base and large-scale models). The reason for not using RoPE in the last layer is that RoPE beyond the training length has not been trained, which would affect length extrapolation. In fact, the RoPE in the first L-1 layers is sufficient to supplement position information for the model; not adding RoPE to the last layer basically does not affect the model’s training performance.

Experimental Results

Clearly, HWFA is a combination of attention types. It can be used in standard Multi-Head Attention or in attention variants like GAU. The author conducted experiments based on GAU_alpha: training length 512, 24 layers of GAU, the first 23 layers using Window Attention with window size w=16. The test metric is per-token accuracy, and the baseline is all layers using Full Attention + RoPE (the conventional default usage).

The results are very encouraging:

Test Length	512	4096
Baseline	49.41%	24.17%
HWFA	48.70%	80.84%

512 represents training accuracy (also called interpolation accuracy), and 4096 represents extrapolation accuracy. Why is the training accuracy only in the 40s while the extrapolation reaches a staggering 80+? This is because, when constructing test samples, the author included some repeated concatenated samples—that is, the same segment of text no longer than 4096 was repeated to reach 4096. Since the latter parts of these samples are repetitions of the earlier parts, the accuracy for these parts is very high (as the standard answer was already provided). This indicates that, as imagined, this design for length extrapolation does not sacrifice global dependency capabilities.

If repeated samples are removed and only normal natural text samples are kept, the results are still respectable:

Test Length	512	4096
Baseline	49.41%	23.16%
HWFA	48.70%	48.15%

To further verify global dependency capabilities, the author also performed the "even pairs" task (determining if the first and last characters are the same) from "The Road to Transformer Upgrade: 8. Length Extrapolation and Position Robustness". The method in this article achieved 100% extrapolation accuracy, which also indicates that the model can learn global dependencies (attention needs to span the entire sequence to accurately judge if they are the same).

The author also conducted some ablation experiments, with the following results:

If Window Attention does not use RoPE, both interpolation and extrapolation performance decrease.
If Full Attention uses RoPE, extrapolation performance decreases.
If Full Attention does not use the \log n factor, extrapolation performance decreases.
If all layers use Window Attention, both interpolation and extrapolation performance decrease.
If changed to L-2 layers of Window Attention + 2 layers of Full Attention, extrapolation performance decreases.
If w=32 (in which case (w-1)(L-1) > N), extrapolation performance decreases.

Comparative Analysis

Some readers might ask: why is there no comparison with other methods? The reason might be unexpected—because when the author experimented with some methods from "The Road to Transformer Upgrade: 7. Length Extrapolation and Local Attention" on GAU, they all failed (extrapolation capability was very poor)!

Why is this? The author’s first reaction was that these related works experimented with standard Multi-Head Attention, while I experimented with GAU. As an attention mechanism, the biggest feature of GAU is that it is single-headed (different from the original GAU, the GAU the author experimented with is also Softmax-normalized). Therefore, the author felt it was the difference between multi-head and single-head. Schemes like ALIBI, Sandwich, and XPOS have parameter designs specifically for multi-head, and their effectiveness on single-head indeed remains to be verified.

However, after further verification, the author found that the difference between single-head and multi-head does not affect length extrapolation as much as imagined, indicating there must be another reason. It wasn’t until a few days ago that the author realized another important difference: the author has always used the Post-Norm architecture, while mainstream work uses Pre-Norm. In "Why is Pre-Norm Not as Effective as Post-Norm?", we analyzed that the depth of Pre-Norm is actually somewhat "diluted." Therefore, when local constraints are applied to every attention layer, the features output by Pre-Norm are actually more localized, resulting in better extrapolation.

So, from the current results, if the author insists on the GAU + Post-Norm combination, the method in this article seems to be the only solution to achieve length extrapolation. This is guaranteed by "translation invariance" and "independent and identical distribution" (i.i.d.). The Window Attention in the first L-1 layers, with a total receptive field not exceeding the training length, leads to "translation invariance," resulting in a series of "i.i.d." features. The Full Attention in the last layer performs a weighted average of these i.i.d. features. From a statistical perspective, the average result of i.i.d. variables can be stably extrapolated.

Additionally, the author has already attempted to compare HWFA with other works under standard Multi-Head Attention and will update everyone when there are further results.

Further Thoughts

From the author’s experimental results, it can be seen that the HWFA combination is slightly worse in training performance compared to the Baseline. Therefore, a natural concern is whether this difference will further amplify as the model scale increases. Or, if the number of parameters increases to tens or hundreds of billions, will such a design possess the same emergent capabilities as the standard design? This is indeed a concern many have in the LLM era regarding various architectural modifications, namely the Scaling Law issue. Admittedly, until HWFA is truly scaled up to the hundred-billion parameter level, there is no definitive answer, but it is initially guessed that there might be a performance bottleneck.

Of course, HWFA can currently only be considered a baseline for length extrapolation. Its main purpose is to achieve length extrapolation while retaining global dependency capabilities. Preliminarily, it has the potential to do so. The next step is to catch up with the Baseline’s training performance while retaining global dependency capabilities. Additionally, HWFA can only capture global dependencies in the very last layer of Full Attention, which likely presents a performance bottleneck. However, if more layers are used, it will lead to a decrease in length extrapolation capability, which is also an issue in urgent need of optimization.

It is worth mentioning that since the Window Attention in the first L-1 layers has only a limited receptive field, it is theoretically possible to replace them with models like CNNs, as long as the total receptive field does not exceed the training length N. Therefore, attempting to combine the thinking of HWFA with other basic architectures is also a direction worth considering.

Summary

This article introduces a length extrapolation scheme conceived by the author. By combining Window Attention and Full Attention, it forms length extrapolation capability while retaining global dependency capability. It should be the only length extrapolation method currently applicable to generative models that possesses global dependency capabilities.

When reposting, please include the original address of this article: https://kexue.fm/archives/9603

For more detailed reposting matters, please refer to: "Scientific Space FAQ"