In the article "Attention Scaling from the Perspective of Entropy Invariance", we derived a version of the attention mechanism with entropy-invariant properties: Attention(Q,K,V) = \text{softmax}\left(\frac{\kappa \log n}{d}QK^{\top}\right)V \label{eq:a} It can be observed that this is primarily achieved by introducing a length-related scaling factor \log n into the Softmax. The original derivation was relatively tedious and involved many assumptions, which was not conducive to intuitive understanding. This article provides a relatively concise and quick derivation for it.
Derivation Process
We can set aside the context of the attention mechanism and directly assume s_1, s_2, \dots, s_n \in \mathbb{R}. Define: p_i = \frac{e^{\lambda s_i}}{\sum_{j=1}^n e^{\lambda s_j}} Clearly, this is the result of applying Softmax to s_1, s_2, \dots, s_n after multiplying them by a scaling factor \lambda. Now we calculate its entropy: \begin{aligned} H &= -\sum_{i=1}^n p_i \log p_i = \log\sum_{i=1}^n e^{\lambda s_i} - \lambda\sum_{i=1}^n p_i s_i \\ &= \log n + \log\frac{1}{n}\sum_{i=1}^n e^{\lambda s_i} - \lambda\sum_{i=1}^n p_i s_i \end{aligned} The term inside the \log in the second line is "exponentiate then average." We approximate it using "average then exponentiate" (mean field approximation): \log\frac{1}{n}\sum_{i=1}^n e^{\lambda s_i} \approx \log\exp\left(\frac{1}{n}\sum_{i=1}^n \lambda s_i\right) = \lambda \bar{s} Furthermore, we know that Softmax tends to focus on the maximum value (refer to "Notes on Function Smoothing: Differentiable Approximation of Non-differentiable Functions"), so we have the approximation: \lambda\sum_{i=1}^n p_i s_i \approx \lambda s_{\max} Therefore: H \approx \log n - \lambda(s_{\max} - \bar{s}) The so-called entropy invariance aims to eliminate the influence of the length n as much as possible. Thus, according to the above formula, we need \lambda \propto \log n. If we apply this to the attention mechanism, the form of s is \langle \boldsymbol{q}, \boldsymbol{k} \rangle \propto d (where d is the vector dimension), so we need \lambda \propto \frac{1}{d}. Combining these requirements gives: \lambda \propto \frac{\log n}{d} This is the result shown in equation [eq:a] at the beginning of the article.
Summary
A simple and clear derivation has been formulated for the previously proposed "Entropy-Invariant Softmax."
Original Address: https://kexue.fm/archives/9034
For more detailed information regarding reprinting, please refer to: "Scientific Space FAQ"