Transformer Upgrade Road: 11. Pushing $ $-base Positional Encoding to the Limit · English (unofficial) translations of posts at kexue.fm

In the article "Transformer Upgrade Road: 10. RoPE is a \beta-base Encoding", we provided a \beta-base interpretation of RoPE and derived NTK-aware Scaled RoPE, which can extend the context length without fine-tuning based on the idea of base conversion. It must be said that understanding positional encoding through the analogy of \beta-base representation is indeed a very beautiful and enlightening perspective, so much so that every time I think deeply about it, I always seem to gain new insights and rewards.

This article will revisit the \beta-base interpretation of RoPE and attempt to generalize the existing NTK-aware Scaled RoPE, in hopes of finding a superior strategy to extend the context length of LLMs without fine-tuning.

Base Analogy

We know that the parameterization of RoPE follows the form of Sinusoidal positional encoding. Whether by coincidence or design, the Sinusoidal positional encoding of an integer n shares many similarities with its \beta-base encoding.

Specifically, the m-th digit (counting from right to left) of the \beta-base representation of an integer n is: \left\lfloor\frac{n}{\beta^{m-1}}\right\rfloor \bmod \beta \label{eq:mod} While its Sinusoidal positional encoding is: \begin{aligned} \boldsymbol{p}_n &= \big[\cos\theta_1, \sin\theta_1, \cos\theta_2, \sin\theta_2, \cdots, \cos\theta_{d/2}, \sin\theta_{d/2}\big] \\ \theta_m &= \frac{n}{\beta^{m-1}}, \quad \beta=10000^{2/d} \end{aligned} \label{eq:sinu} As can be seen, both contain the same term \frac{n}{\beta^{m-1}}, and both \bmod and \cos, \sin are periodic functions. The only difference between the two is the insignificant floor operation \lfloor\cdot\rfloor. Therefore, analogizing RoPE/Sinusoidal positional encoding to its \beta-base representation is a very intuitive and reasonable result.

Correcting NTK

Following the logic in "Transformer Upgrade Road: 10. RoPE is a \beta-base Encoding", direct extrapolation concentrates the pressure on the "high bits" (where m is larger), while positional interpolation makes the representation of "low bits" (where m is smaller) denser, which is detrimental to distinguishing relative distances. NTK-aware Scaled RoPE is essentially a base conversion that spreads the extrapolation pressure across every bit while maintaining the interval between adjacent positions. These characteristics are very friendly and critical for LLMs, which tend to rely heavily on relative positions. Thus, it can achieve certain effects even without fine-tuning.

Looking closely at Equation [eq:sinu], \cos and \sin actually form a single unit, so there are effectively only d/2 bits. This means it is equivalent to a d/2-digit \beta-base encoding of n. If we want to extend the context length by k times by converting the \beta-base to a \beta\lambda-base, we should have at least: \lambda^{d/2} = k \quad \Rightarrow \quad \lambda = k^{2/d} Thus, the new RoPE becomes: \begin{aligned} \boldsymbol{p}_n &= \big[\cos\theta_1, \sin\theta_1, \cos\theta_2, \sin\theta_2, \cdots, \cos\theta_{d/2}, \sin\theta_{d/2}\big] \\ \theta_m &= \frac{n}{(\beta\lambda)^{m-1}}, \quad \beta=10000^{2/d}, \quad \lambda = k^{2/d} \end{aligned} \label{eq:ntk-old} This is the NTK-RoPE we proposed in the previous article.

However, after further reflection, I realized this is not entirely reasonable. Returning to Equation [eq:mod], if we want to calculate the m-th digit of a \beta\lambda-base, it should be: \left\lfloor\frac{n}{(\beta\lambda)^{m-1}}\right\rfloor \bmod (\beta\lambda) That is to say, in addition to replacing \frac{n}{\beta^{m-1}} with \frac{n}{(\beta\lambda)^{m-1}}, the period of the \bmod operation should also be expanded by \lambda times. This is equivalent to dividing by an additional \lambda before calculating \cos and \sin: \begin{aligned} \boldsymbol{p}_n &= \big[\cos\theta_1, \sin\theta_1, \cos\theta_2, \sin\theta_2, \cdots, \cos\theta_{d/2}, \sin\theta_{d/2}\big] \\ \theta_m &= \frac{n}{\lambda(\beta\lambda)^{m-1}}, \quad \beta=10000^{2/d}, \quad \lambda = k^{2/d} \end{aligned} \label{eq:ntk-fixed} In the subsequent experiments, we refer to Equation [eq:ntk-old] proposed in the previous article as "NTK-RoPE-old" and Equation [eq:ntk-fixed] as "NTK-RoPE-fixed".

Mixed Base

Now, let us be even more "imaginative"—since we can represent positions using a \beta-base, why not use a more generalized "mixed-base" system? A mixed-base system refers to one where the radix for each digit is not necessarily the same. This is not unfamiliar to us; for example, 60 seconds make 1 minute, 60 minutes make 1 hour, but 24 hours make 1 day, and 7 days make 1 week. Here, 60, 60, 24, and 7 are different radices. In other words, seconds, minutes, hours, days, and weeks are an example of a mixed-base system.

Assuming that from right to left, the 1st digit uses base \beta_1, the 2nd digit uses base \beta_2, the 3rd digit uses base \beta_3, and so on, then the m-th digit of n is: \left\lfloor\frac{n}{\beta_1\beta_2\cdots\beta_{m-1}}\right\rfloor \bmod \beta_m \label{eq:mod2} Why consider a mixed-base system? This is because I discovered an interesting fact: RoPE is essentially a relative positional encoding. Relative position is a special case of a Toeplitz matrix, which looks like this (since we are primarily concerned with language models, the upper right part is omitted): \begin{pmatrix} 0 & \\ 1 & 0 & \\ 2 & 1 & 0 & \\ 3 & 2 & 1 & 0 & \\ 4 & 3 & 2 & 1 & 0 & \\ 5 & 4 & 3 & 2 & 1 & 0 & \\ 6 & 5 & 4 & 3 & 2 & 1 & 0 & \\ \end{pmatrix} From the above matrix, we can see that the distribution of relative positions is uneven! 0 appears most frequently, followed by 1, then 2, and so on. This means that as n increases, its frequency decreases. Consequently, as a \beta-base encoding, the "high bits" of RoPE are likely to be insufficiently trained; in other words, the generalization ability of high bits is likely inferior to that of low bits. As mentioned earlier, NTK-RoPE spreads the extrapolation pressure across every bit. If this hypothesis is correct, then "equal spreading" is not optimal. Instead, low bits should bear more of the burden, and high bits less, leading to a mixed-base approach.

Optimization of Distribution

Specifically, we extend the context by a factor of k by converting the \beta-base into a mixed-base system with radices \beta_1, \beta_2, \cdots, \beta_{d/2}, where \beta_m = \beta \lambda_m. In this case, Equation [eq:mod2] becomes: \left\lfloor\frac{n}{\beta^{m-1}(\lambda_1\lambda_2\cdots\lambda_{m-1})}\right\rfloor \bmod (\beta\lambda_m) Equation [eq:ntk-fixed] correspondingly becomes: \begin{aligned} \boldsymbol{p}_n &= \big[\cos\theta_1, \sin\theta_1, \cos\theta_2, \sin\theta_2, \cdots, \cos\theta_{d/2}, \sin\theta_{d/2}\big] \\ \theta_m &= \frac{n}{\beta^{m-1}(\lambda_1\lambda_2\cdots\lambda_m)}, \quad \beta=10000^{2/d} \end{aligned} Based on the principles of "extending by k times" and "low bits bearing more burden," the constraints are: \lambda_1\lambda_2\cdots\lambda_{d/2} = k, \quad \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_{d/2} \geq 1 We consider a solution of the following form: \lambda_1\lambda_2\cdots\lambda_m = \exp(am^b) When a > 0, b \leq 1, it satisfies the condition \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_{d/2} \geq 1. When b=1, it reduces to the "NTK-RoPE-fixed" mentioned earlier. When b=0, it becomes Positional Interpolation (PI). The constraint \lambda_1\lambda_2\cdots\lambda_{d/2} = k gives: a\left(\frac{d}{2}\right)^b = \log k Thus, there is only one degree of freedom to tune. Through simple binary search, I found that in my experiments, b=0.625 yields a good average extension effect (different models may have different optimal solutions). This version is called "NTK-RoPE-mixed".

Experimental Results

Based on the experiments in "Transformer Upgrade Road: 10. RoPE is a \beta-base Encoding", I added experiments for "NTK-RoPE-fixed" and "NTK-RoPE-mixed". The comparison is as follows:

\begin{array}{c|ccc} \hline \text{Test Length} & 512 (\text{Train}) & 4096 (\text{Repeat}) & 4096 (\text{Non-repeat}) \\ \hline \text{Baseline} & 49.41\% & 24.17\% & 23.16\% \\ \text{Baseline-} \log n & 49.40\% & 24.60\% & 24.02\% \\ \hline \text{PI-RoPE} & 49.41\% & 15.04\% & 13.54\% \\ \text{PI-RoPE-} \log n & 49.40\% & 14.99\% & 16.51\% \\ \hline \text{NTK-RoPE-old} & 49.41\% & 51.28\% & 39.27\% \\ \text{NTK-RoPE-} \log n\text{-old} & 49.40\% & 61.71\% & 43.75\% \\ \hline \text{NTK-RoPE-fixed} & 49.41\% & 51.86\% & 39.61\% \\ \text{NTK-RoPE-} \log n\text{-fixed} & 49.40\% & 62.85\% & 44.14\% \\ \text{NTK-RoPE-mixed} & 49.41\% & 53.09\% & 40.12\% \\ \text{NTK-RoPE-} \log n\text{-mixed} & 49.40\% & \boldsymbol{68.91\%} & \boldsymbol{45.41\%} \\ \hline \end{array}

As can be seen, compared to the equal-base "NTK-RoPE-old" and "NTK-RoPE-fixed", the "NTK-RoPE-mixed" derived from the mixed-base approach brings a significant improvement. Moreover, it requires no fine-tuning, making it a "free lunch." Additionally, the \log n version performs better in extrapolation. However, the \log n trick needs to be added during the pre-training phase. Some readers have asked whether models like LLaMA, which did not include the \log n trick during pre-training, can still benefit from it. My tests show that it can be improved by adding the following scale factor: \max(1, \log_{\text{maxlen}} n) \label{eq:plogn} Here, \text{maxlen} is the maximum length during pre-training (512 in this experiment, 2048 in LLaMA, and 4096 in LLaMA2). In implementation, each \boldsymbol{q}_n can be multiplied by this factor. This way, parts within \text{maxlen} are unaffected, while parts beyond it are scaled by \log n. This serves as a simple transition. The results are as follows (using \color{red}{\dagger} to distinguish from the original \log n):

\begin{array}{c|ccc} \hline \text{Test Length} & 512 (\text{Train}) & 4096 (\text{Repeat}) & 4096 (\text{Non-repeat}) \\ \hline \text{NTK-RoPE-fixed} & 49.41\% & 51.86\% & 39.61\% \\ \text{NTK-RoPE-} \log n^{\color{red}{\dagger}}\text{-fixed} & 49.41\% & 55.94\% & 41.11\% \\ \text{NTK-RoPE-mixed} & 49.41\% & 53.09\% & 40.12\% \\ \text{NTK-RoPE-} \log n^{\color{red}{\dagger}}\text{-mixed} & 49.41\% & 59.11\% & 42.38\% \\ \hline \end{array}

This \log n^{\color{red}{\dagger}} can also be considered a free lunch. In summary, if you plan to pre-train from scratch, it is advisable to include the \log n trick beforehand. If training is already complete, you can use Equation [eq:plogn] as a substitute, combined with NTK-RoPE-mixed, to achieve superior context extension.

Summary

In this article, we revisited the \beta-base perspective of RoPE and attempted to generalize NTK-aware Scaled RoPE. Inspired by the mixed-base concept, we obtained a superior strategy for extending context length without fine-tuning and demonstrated its effectiveness through experiments.