Entropy-Invariant Attention from the Perspective of the JL Lemma · English (unofficial) translations of posts at kexue.fm

In the posts “Looking at the Scale Operation of Attention from Entropy Invariance” and “A Quick Derivation of Entropy-Invariant Softmax”, the author proposed the Entropy-Invariant Softmax. Simply put, this involves multiplying the Attention matrix by an additional factor of \log n before the Softmax operation, which theoretically helps enhance length extrapolation, where n is the sequence length. The factor \log n reminded the author of the JL Lemma (Johnson-Lindenstrauss Lemma), because the JL Lemma tells us that encoding n vectors only requires \mathcal{O}(\log n) dimensions. Since both involve \log n, is there a connection between the two?

Entropy Invariance

We know that entropy is a measure of uncertainty. In the context of the attention mechanism, we use it as a measure of the “degree of concentration of attention.” So-called entropy invariance means that regardless of the sequence length n, we want to concentrate attention on a few key tokens rather than spreading it too thinly. To this end, the proposed form of Entropy-Invariant Attention is: Attention(Q,K,V) = \text{softmax}\left(\frac{\log_{512} n}{\sqrt{d}}QK^{\top}\right)V \label{eq:core} Here Q,K \in \mathbb{R}^{n\times d}. Compared to conventional Attention, the scaling factor is multiplied by an additional \log_{512} n, where the base is chosen as 512, assuming all our hyperparameters (such as d) are tuned for a training length of 512. Of course, even if your planned pre-training length is not 512, you can simply choose 512 as the base, and the result will basically not be affected.

The principle of this form is also quite intuitive. When n increases, it means more tokens are sharing the attention, leading to a lack of concentration. At this point, we multiply by a factor that is monotonically increasing with respect to n. After the softmax, this effectively acts as a power operation on the original probabilities. Since probabilities are all less than 1, smaller probabilities become even smaller after the power operation, thus making the attention concentrated again. As for why this factor takes a logarithmic form, one would need to refer to the derivation process in the articles mentioned at the beginning.

JL Lemma

The JL Lemma, full name “Johnson-Lindenstrauss Lemma,” is an important conclusion regarding vector embedding. Simply put, it tells us that “to accommodate n vectors, only \mathcal{O}(\log n) dimensional space is needed” (where \log here does not specify a base and defaults to the natural logarithm e). For a detailed introduction, please refer to “The Amazing Johnson-Lindenstrauss Lemma: Theory Part”.

Interestingly, long before the author knew about the JL Lemma, a similar and even more specific result was derived in “The Principle of Minimum Entropy (VI): How to Choose the Dimension of Word Vectors?”—embedding n word vectors roughly requires 8\log n dimensions. This estimate is very close to the dimensions used in practice. For example, when n equals 100,000, 8\log n is approximately 92, while the word vector dimensions we frequently use are also in the range of one or two hundred.

Additionally, the JL Lemma can be used to explain the multi-head nature of the attention mechanism. If we substitute n=512, then 8\log n \approx 50, which is very close to the commonly used projection dimension for Q and K in Attention (i.e., key_size, which is 64 in BERT; refer to here). This tells us that if the sequence length is 512, then a dimension of around 50 for the Q and K of Attention is sufficient; there is no need to use the full hidden_size (which is 768 in BERT-base). The saved dimensions can instead be used for multi-head attention.

Making the Connection

Now, we can attempt to link the JL Lemma with Entropy-Invariant Attention.

Let the key_size of Q and K be d. The JL Lemma tells us that the optimal choice for d should be d_n = \lambda \log n, where \lambda is a proportionality constant, the specific value of which is not important. That is to say, in an ideal scenario, d should change as n changes. However, it is obvious that such a design is not easy to implement and is not conducive to the parallelization of computation. Therefore, in practice, we can only use a fixed d.

Suppose we have selected a fixed d, and assume this d is designed for a training length of 512. Then we can deduce d = \lambda \log 512, which means \lambda = \frac{d}{\log 512}, and: d_n = \frac{d}{\log 512}\log n = d \log_{512} n For n \neq 512, ideally, a projection dimension of d_n should be used, but in practice, d dimensions are used. According to the definition of the inner product \langle q,k \rangle = \sum_{i=1}^d q_i k_i, the number of terms in the sum is exactly equal to the dimension d. That is to say, ideally there should be d_n terms in the sum, but in reality, it becomes d terms. Intuitively, if the contribution of each term is similar, then after multiplying the result by \frac{d_n}{d}, we can make the result closer to the ideal case of a d_n-term sum. Therefore, we conclude that we should multiply \langle q,k \rangle by the factor: \frac{d_n}{d} = \log_{512} n to compensate for the gap between the actual situation and the ideal situation. When the conventional Scaled-Dot Attention is multiplied by \log_{512} n, it becomes exactly the Entropy-Invariant Attention, which is Equation [eq:core].

In this way, we have established a connection between the JL Lemma and Entropy-Invariant Attention. Note that this is only an intuitive and qualitative understanding process. It is difficult to further formalize it from a quantitative perspective, and in fact, there is no need for further quantification, as the JL Lemma itself is mostly a qualitative conclusion.

Summary

This article constructs a simple connection between the JL Lemma and Entropy-Invariant Attention.