Transformer Upgrade Road: 14. When HWFA Meets ReRoPE · English (unofficial) translations of posts at kexue.fm

In the previous article "Transformer Upgrade Road: 13. Reverse Leaky ReRoPE", I attempted to use the idea of reversing Leaky ReRoPE during the training phase so that the position encoding during the inference phase becomes normal RoPE. This was intended to achieve length extrapolation while solving the disadvantage of slow inference in ReRoPE. Unfortunately, experimental results showed that the effect of "Leaky ReRoPE \to RoPE" was not as good as "RoPE \to ReRoPE/Leaky ReRoPE," so this problem has not been fully resolved.

At this point, I recalled that HWFA, which I previously proposed in "Transformer Upgrade Road: 9. A New Idea for Global Length Extrapolation", inherently possesses a certain degree of length extrapolation capability. If combined with ReRoPE in a "powerful alliance," would it yield better results? More importantly, the addition of HWFA can significantly reduce inference costs, thereby making up for the shortcomings of ReRoPE!

Review

First, as a "matter of routine," let’s review HWFA. HWFA (Hybrid Window-Full Attention) is not a specific model but rather a combination of Attention mechanisms. it can enhance the length extrapolation capability of Attention models while maintaining performance and reducing both training and inference costs.

Specifically, HWFA consists of "L-1 layers of Window RoPE Attention + 1 layer of Full NoPE Attention." That is, the first L-1 layers of Attention are equipped with RoPE and have their receptive fields restricted by a window. This makes the inference cost constant, and optimization based on block parallelism can also improve training speed. As for the final layer of Attention, it retains the global form but removes position encoding (NoPE) and adds \log n scaling. After these modifications and an appropriate choice of window, the model’s training performance only slightly decreases while exhibiting excellent length extrapolation capabilities.

Coincidentally, Google later proposed FOT (Focused Transformer), which shares many similarities with HWFA: it also uses L-1 layers of Local Attention plus 1 layer of Full Attention, and the Full Attention is also NoPE. The difference is that FOT places the Full Attention in the middle, and the Local Attention does not strictly limit the receptive field, so it cannot extrapolate length directly. Therefore, it proposed cross-batch training to extend the model length. Subsequently, I experimented with using cross-batch training on HWFA and achieved good results.

New Insights

Returning to the theme of this article, how can HWFA and ReRoPE be "joined in a powerful alliance"? We know that ReRoPE is used in Full RoPE Attention by truncating the relative position matrix during the inference phase:

\begin{pmatrix}0 & \\ 1 & 0 & \\ 2 & 1 & 0 &\\ \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 2} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 1} & \small{L - 2} & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \end{pmatrix} \,\to\, \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix}

Surprisingly, such post-processing demonstrates excellent length extrapolation capabilities. However, due to the particularity of RoPE, the original ReRoPE implementation requires calculating the Attention matrix twice and is incompatible with mainstream Flash Attention acceleration. Overall, the increase in inference cost is somewhat significant.

However, the addition of HWFA will greatly alleviate this problem! As mentioned, ReRoPE is only used for Full RoPE Attention, while HWFA mostly consists of Window RoPE Attention. Thus, the "HWFA+ReRoPE" scheme emerges naturally: during the training phase, replace the original Full NoPE Attention in HWFA with Full RoPE Attention, and then switch to Full ReRoPE Attention during the inference phase. In this way, the additional cost brought by switching to ReRoPE during inference becomes very small, and the benefits brought by switching other layers to Window Attention are even more significant.

In addition, "HWFA+ReRoPE" can compensate for the performance loss of the original HWFA. Previously, to ensure length extrapolation capability, the Full Attention in HWFA had to remove position encoding (NoPE), and the receptive field \tilde{w} of Window Attention had to satisfy (\tilde{w}-1)(L-1)+1 = \alpha N (where L is the number of layers, N is the training length, and 0 < \alpha \leq 1). These constraints limited the model’s expressive power, leading to degraded training performance. With the introduction of ReRoPE, the receptive field of Window Attention can be appropriately larger, Full Attention can use RoPE, and it can be placed in intermediate layers rather than just the last layer. One could even use more than one layer of Full Attention. These changes can compensate for performance loss, and thanks to ReRoPE, the length extrapolation capability will not decrease.

To distinguish it from the original version of HWFA, we can also call the combination of "HWFA+ReRoPE" as "HWFA2."

Experiments

Below are some experimental results for "HWFA+ReRoPE (HWFA2)." Since the introduction of ReRoPE gives HWFA much more flexibility, I have only selected combinations that I believe are intuitive for experimentation, and it is impossible to fully verify all permutations.

The experimental model is the same as the previous HWFA and ReRoPE experiments: a GAU model with 100 million parameters and a training length of 512. Note that there are two window parameters here: one is the w parameter inherent to ReRoPE (previous ReRoPE experiments showed this has little impact, so it is uniformly set to 256 below); the other is the receptive field of HWFA’s Window Attention, denoted as \tilde{w} above, which is adjustable. Therefore, the main parameters of "HWFA+ReRoPE" are the \tilde{w} of Window Attention, and the number and distribution of Full Attention layers. My previous comparative experiments showed that, from a training perspective, placing Full Attention in the middle is better than at the end. Thus, if there is 1 layer of Full Attention, its default position is the layer at (index = num_layers / 2); if there are 2 layers, the default positions are (index = num_layers / 3) and (index = 2 * num_layers / 3), and so on.

Some experimental results are as follows:

Test Length	512 (Train)	4096 (Repeated)	4096 (Non-repeated)
Baseline	49.41%	24.17%	23.16%
Baseline-\log n	49.40%	24.60%	24.02%
ReRoPE-w256	49.41%	77.90%	48.48%
ReRoPE-w256-\log n^{\dagger}	49.41%	82.40%	48.85%
ReRoPE-w256-\log n	49.40%	85.12%	49.07%
InvLeaky ReRoPE-w128-\log n	49.38%	82.25%	48.32%
InvLeaky ReRoPE-w128-b8-\log n	49.62%	81.15%	48.85%
HWFA	48.70%	80.84%	48.15%
HWFA-ReRoPE-w32-f1	49.29%	83.13%	49.34%
HWFA-ReRoPE-w64-f1	49.32%	82.41%	49.37%
HWFA-ReRoPE-w128-f1	49.21%	80.18%	48.99%
HWFA-ReRoPE-w256-f1	49.00%	54.94%	47.64%
HWFA-ReRoPE-w32-f2	49.50%	84.09%	49.35%
HWFA-ReRoPE-w64-f2	49.46%	84.43%	49.36%
HWFA-ReRoPE-w128-f2	49.35%	83.09%	48.97%
HWFA-ReRoPE-w256-f2	49.37%	75.24%	48.42%

In the table above, the number after "w" is the size of the Window Attention receptive field \tilde{w}, and the number after "f" is the number of Full Attention layers. In the original HWFA, due to various constraints, \tilde{w} was only taken as 16; if it were larger, the length extrapolation capability would drop significantly. As can be seen from the table, after increasing \tilde{w}, the training performance quickly aligns with the baseline, and further increasing Full Attention layers even surpasses the baseline. Regarding extrapolation effects, the two cases w32 and w64 are quite good, significantly exceeding HWFA. Overall, the best combination for HWFA-ReRoPE is w64-f2, where both training performance and non-repeated extrapolation performance exceed the original ReRoPE. Considering the training length N is 512 and the number of layers L is 24, I speculate that the optimal value for \tilde{w} should be around 2 \sim 4 times N/L.

Summary

This article proposes a method for combining HWFA and ReRoPE. Small-scale experimental results show that this combination can achieve near-optimal length extrapolation effects without sacrificing training performance. Furthermore, thanks to the design of HWFA, it can significantly reduce inference costs, effectively mitigating the disadvantage of increased inference overhead in the original ReRoPE.

Reprinted from: https://kexue.fm/archives/9731