A Theoretical Flaw and Countermeasure of Transformers with Relative Position Encoding · English (unofficial) translations of posts at kexue.fm

Position encoding is a crucial component of the Transformer architecture. In "Position Encodings in Transformers that Make Researchers Rack Their Brains", we summarized several common designs for position encoding. Broadly speaking, we categorize Transformer position encodings into two types: "absolute position encoding" and "relative position encoding." Among these, "relative position encoding" generally performs better in many NLP and CV experiments.

However, it is observable that almost all current relative position encodings operate on the Attention matrix before the Softmax operation. This method of application actually harbors a theoretical flaw that prevents the Transformer from becoming a "universal approximator." This article analyzes this issue and explores potential solutions.

A Simple Probe

As the name suggests, position encoding is used to supplement the model with positional information. So, how can we determine if a model has sufficient capability to recognize positions? I previously conceived a simple probe experiment:

For a model with the capability to recognize positions, it should be able to accurately implement the following mapping: \begin{array}{lc} \text{Input:} & [0, 0, \cdots, 0, 0] \\ & \downarrow \\ \text{Output:} & [1, 2, \cdots, n-1, n] \end{array}

In other words, given n zeros as input, the model should be able to output the position indices 1 \sim n in order. The idea behind this probe is simple: if a model can achieve this, it indicates that position recognition is an inherent capability of the model itself, independent of external input, which is exactly what we want. It is easy to see that absolute position encoding, being directly applied to the input, can easily pass this probe test.

Incapable of the Task

However, when I considered Transformer models with relative position encoding using this simple probe, I found that almost none of them could complete the task.

Specifically, except for the design proposed in "Self-Attention with Relative Position Representations", all other relative position encodings (including RoPE, which I proposed) only modify the Attention matrix before Softmax. Consequently, the Attention matrix containing relative position information remains a probability matrix (i.e., each row sums to 1).

On the other hand, for a Transformer model, the only source of interaction between tokens is the \boldsymbol{A}\boldsymbol{V} step of Self-Attention, which can be written as \boldsymbol{o}_i = \sum_j a_{i,j}\boldsymbol{v}_j. Identical inputs mean that every \boldsymbol{v}_j is identical, so: \boldsymbol{o}_i = \sum_j a_{i,j}\boldsymbol{v}_j = \sum_j a_{i,j}\boldsymbol{v} = \left(\sum_j a_{i,j}\right)\boldsymbol{v} = \boldsymbol{v} This means every \boldsymbol{o}_i is also identical. In other words, every position in the model outputs the same result from start to finish, making it impossible for the model to output distinct values like [1, 2, \cdots, n-1, n].

Similar findings appeared in the recent paper "Your Transformer May Not be as Powerful as You Expect". The authors constructed slightly different examples to demonstrate the flaw in the fitting capability of Transformers with relative position encoding; the two approaches are different tunes played with equal skill. Furthermore, I mentioned "universal approximation" at the beginning—does solving this counterexample guarantee universal approximation? The cited paper provides corresponding theoretical analysis to confirm this fact, which I will not detail here.

Preliminary Solution

A little reflection reveals that the problem mainly stems from the fact that each row of the Attention matrix sums to 1. To solve this, we need to find a way to break this constraint. To this end, "Your Transformer May Not be as Powerful as You Expect" further proposed the following design: \boldsymbol{O} = (\boldsymbol{A} \odot \boldsymbol{C})\boldsymbol{V} \quad \text{or equivalently} \quad \boldsymbol{o}_i = \sum_j a_{i,j}c_{i,j}\boldsymbol{v}_j where \boldsymbol{C} is a trainable parameter matrix and \odot denotes element-wise multiplication (Hadamard product). To ensure the entire model still only contains relative position information (as we are discussing the flaws of relative position encoding Transformers), we must constrain \boldsymbol{C} to be a Toeplitz matrix, i.e., c_{i,j} = g(i-j).

With the addition of \boldsymbol{C}, the sum of each row in \boldsymbol{A} \odot \boldsymbol{C} as a whole is not necessarily 1, thereby breaking the restriction and solving the problem (for more experimental results, please refer to the original paper). However, this introduces a new parameter matrix. Since \boldsymbol{C} itself is of finite size, it does not naturally support variable-length inputs (or the matrix \boldsymbol{C} must be truncated accordingly, e.g., in the form c_{i,j} = g(\text{clip}(i-j, p_{\min}, p_{\max}))), which overall feels less concise and elegant.

Removing the Denominator

Returning to the root of the problem: the rows of the Attention matrix sum to 1. What operation leads to this phenomenon? The answer is obviously Softmax: a_{i,j} = \frac{e^{b_{i,j}}}{\sum_j e^{b_{i,j}}} where \boldsymbol{B} = (b_{i,j}) is the matrix before Softmax. Clearly, the step of "dividing by \sum_j e^{b_{i,j}}" leads to \sum_j a_{i,j} = 1. A very direct thought is:

If I don’t want \sum_j a_{i,j} = 1, why not just stop dividing by \sum_j e^{b_{i,j}}?

In fact, this works! Experimental results show that a Transformer that does not divide by this denominator can indeed successfully complete the aforementioned probe test. At this point, one cannot help but admire the "foresight" of the GAU (Gated Attention Unit). Its proposed new Attention mechanism uses \text{relu}^2 activation and then simply divides by n for normalization, avoiding \sum_j a_{i,j} = 1 and thereby enhancing the model’s theoretical capability (though perhaps the authors didn’t think that far ahead, and this is largely my own interpretation).

New Normalization

However, we noted in "It Seems Attention and Softmax are a Better Match" that Attention designs like those in GAU, which do not perform probability normalization, might suffer from poor extrapolation capabilities. That is to say, while probability normalization causes the theoretical flaw mentioned earlier, simply dividing by n for normalization might lead to poor extrapolation. Is there a solution that can balance both?

Let’s brainstorm further. From the perspective of norms, \sum_j e^{b_{i,j}} is actually the l_1 norm of the vector e^{b_{i,:}}. Thus, Softmax is essentially an l_1 normalization of the vector e^{b_{i,:}}. To avoid \sum_j a_{i,j} = 1 while retaining normalization, could we switch to other normalization operations? For example, l_2 normalization: a_{i,j} = \frac{e^{b_{i,j}}}{\sqrt{\sum_j e^{2b_{i,j}}}}

Based on my tests, this l_2-normalized Attention can indeed successfully complete the probe experiment. So, does this change help in the NLP pre-training scenarios we care more about? I conducted corresponding comparative experiments, and the results are twofold:

For the standard Attention + FFN combination, before applying l_2-normalized Attention, one needs to reduce the initial variance of the Attention’s \boldsymbol{W}_V and \boldsymbol{W}_O. The experimental results were slightly worse than conventional l_1-normalized Attention.
For the full GAU architecture, l_2-normalized Attention can be applied directly without changing the initialization. The experimental results were slightly better than conventional l_1-normalized Attention.

The difference likely stems from their inherent initialization methods. In the standard Attention + FFN combination, the initial Attention matrix is close to a uniform matrix (every value is the same), whereas in "Does the Gated Attention Unit (GAU) Still Need Warmup?", we analyzed that the GAU’s initial Attention matrix is closer to (a multiple of) the identity matrix.

A Turn of Events

Looking back at the entire discussion, we found that because "every \boldsymbol{v}_j is identical," the "model where \sum_j a_{i,j} = 1 cannot complete the probe experiment." But what if the \boldsymbol{v}_j are not all identical?

We know that since BERT, mainstream Transformer models have designed inputs like "[CLS] SENT [SEP]", meaning marker tokens are attached before and after the input. If we treat these marker tokens as part of the model rather than the input (i.e., the input is "[CLS] 0 0 \cdots 0 0 [SEP]" rather than all zeros), is it possible to complete the probe?

I also experimented with this and found that after supplementing the input with marker tokens, the probe experiment could be completed without any modifications to other parts of the relative position encoding Transformer. This result is somewhat ironic—it turns out the authors of BERT also had great "foresight." The added special tokens [CLS] and [SEP] also serve an auxiliary role in positioning. The theoretical flaw we analyzed for so long was actually solved by two special tokens. This reminds one of the conclusion in "How Much Position Information Do Convolutional Neural Networks Encode?" that "CNNs recognize absolute positions through padding"; there is a certain commonality between the two.

Of course, this does not mean our previous reflections were meaningless. For example, for the GAU model, switching Attention to l_2 normalization indeed has the effect of accelerating convergence and slightly improving performance. Furthermore, since l_2 normalization is acceptable, could e^{b_{i,j}} be replaced with a general activation function (e.g., removing the non-negativity constraint)? I briefly experimented with "\text{swish}(b_{i,j}) + l_2 normalization" and found it to be somewhat feasible. From this perspective, Attention under l_2 normalization actually offers more room for expansion.

Conclusion

This article analyzed an implicit flaw in Transformers with relative position encoding and explored corresponding countermeasures, leading to reflections on the non-negativity and normalization methods of the Attention matrix.

Reprinted from: https://kexue.fm/archives/9105

For more details regarding reprinting, please refer to: "Scientific Space FAQ"