English (unofficial) translations of posts at kexue.fm
Source

A Quick Derivation of Entropy-Invariant Softmax

Translated by DeepSeek V4 Pro. Translations can be inaccurate, please refer to the original post for important stuff.

In the article "Attention Scaling from the Perspective of Entropy Invariance", we derived a version of the attention mechanism with entropy-invariant properties: Attention(Q,K,V) = \text{softmax}\left(\frac{\kappa \log n}{d}QK^{\top}\right)V \label{eq:a} It can be observed that this is primarily achieved by introducing a length-related scaling factor \log n into the Softmax. The original derivation was relatively tedious and involved many assumptions, which was not conducive to intuitive understanding. This article provides a relatively concise and quick derivation for it.

Derivation Process

We can set aside the context of the attention mechanism and directly assume s_1, s_2, \dots, s_n \in \mathbb{R}. Define: p_i = \frac{e^{\lambda s_i}}{\sum_{j=1}^n e^{\lambda s_j}} Clearly, this is the result of applying Softmax to s_1, s_2, \dots, s_n after multiplying them by a scaling factor \lambda. Now we calculate its entropy: \begin{aligned} H &= -\sum_{i=1}^n p_i \log p_i = \log\sum_{i=1}^n e^{\lambda s_i} - \lambda\sum_{i=1}^n p_i s_i \\ &= \log n + \log\frac{1}{n}\sum_{i=1}^n e^{\lambda s_i} - \lambda\sum_{i=1}^n p_i s_i \end{aligned} The term inside the \log in the second line is "exponentiate then average." We approximate it using "average then exponentiate" (mean field approximation): \log\frac{1}{n}\sum_{i=1}^n e^{\lambda s_i} \approx \log\exp\left(\frac{1}{n}\sum_{i=1}^n \lambda s_i\right) = \lambda \bar{s} Furthermore, we know that Softmax tends to focus on the maximum value (refer to "Notes on Function Smoothing: Differentiable Approximation of Non-differentiable Functions"), so we have the approximation: \lambda\sum_{i=1}^n p_i s_i \approx \lambda s_{\max} Therefore: H \approx \log n - \lambda(s_{\max} - \bar{s}) The so-called entropy invariance aims to eliminate the influence of the length n as much as possible. Thus, according to the above formula, we need \lambda \propto \log n. If we apply this to the attention mechanism, the form of s is \langle \boldsymbol{q}, \boldsymbol{k} \rangle \propto d (where d is the vector dimension), so we need \lambda \propto \frac{1}{d}. Combining these requirements gives: \lambda \propto \frac{\log n}{d} This is the result shown in equation [eq:a] at the beginning of the article.

Summary

A simple and clear derivation has been formulated for the previously proposed "Entropy-Invariant Softmax."

Original Address: https://kexue.fm/archives/9034

For more detailed information regarding reprinting, please refer to: "Scientific Space FAQ"