In recent weeks, I have been reflecting on the properties of the attention mechanism. During this process, I have gained a deeper understanding of attention and Softmax. In this article, I will briefly share two of these findings:
1. Softmax attention is naturally resistant to certain noise perturbations;
2. The initialization problem can be intuitively understood from the perspective of information entropy.
Robustness
The attention mechanism based on Softmax normalization can be written as: \begin{equation} o = \frac{\sum\limits_{i=1}^n e^{s_i} v_i}{\sum\limits_{i=1}^n e^{s_i}} \end{equation} One day, a question occurred to me: what happens if we add independent and identically distributed (i.i.d.) noise to s_i? To investigate this, we consider: \begin{equation} \tilde{o} = \frac{\sum\limits_{i=1}^n e^{s_i+\varepsilon_i} v_i}{\sum\limits_{i=1}^n e^{s_i+\varepsilon_i}} \end{equation} where \varepsilon_i is i.i.d. noise. However, after a simple analysis, I found that the conclusion is "not much happens"—the attention mechanism is naturally resistant to this type of noise, i.e., \tilde{o} \approx o.
To understand this, one only needs to realize that: \begin{equation} \tilde{o} = \frac{\frac{1}{n}\sum\limits_{i=1}^n e^{s_i+\varepsilon_i} v_i}{\frac{1}{n}\sum\limits_{i=1}^n e^{s_i+\varepsilon_i}} = \frac{\mathbb{E}_i[e^{s_i+\varepsilon_i} v_i]}{\mathbb{E}_i[e^{s_i+\varepsilon_i}]} \approx \frac{\mathbb{E}_i[e^{s_i}v_i]\mathbb{E}[e^{\varepsilon}]}{\mathbb{E}_i[e^{s_i}]\mathbb{E}[e^{\varepsilon}]} = \frac{\mathbb{E}_i[e^{s_i}v_i]}{\mathbb{E}_i[e^{s_i}]} = o \end{equation} The approximation utilizes the fact that \varepsilon_i is independent of s_i and v_i, so the expectation of the product equals the product of the expectations.
Information Content
If we denote p_i = e^{s_i} \big/ \sum\limits_{i=1}^n e^{s_i}, then p_i describes a discrete probability distribution. We can calculate its information entropy: \begin{equation} H = -\sum_{i=1}^n p_i \log p_i \quad \in [0, \log n] \end{equation} In "Entropy" is Hard to Afford: From Entropy and the Maximum Entropy Principle to Maximum Entropy Models (Part 1), we discussed that entropy is a measure of uncertainty and also a measure of information content. How can we understand the connection between the two? Entropy is essentially a measure of uniformity; the more uniform a distribution, the more uncertain it is. Thus, entropy measures uncertainty. Since the lower bound of entropy is 0, uncertainty also represents the maximum amount of information we can obtain when moving from "uncertainty" to "complete certainty."
We know that if s_i is initialized to be very large, p_i will approach a one-hot distribution. At this point, training becomes impossible due to vanishing gradients (refer to A Brief Discussion on the Initialization, Parameterization, and Normalization of Transformers). I found that this can be understood very intuitively from the perspective of information content: model training is itself a process of moving from uncertainty (a random model) to certainty (a trained model). The optimizer is responsible for "extracting" information from the random model. However, the information content of a one-hot distribution is 0; the optimizer has "nothing to gain" and might even have to "pay out," so naturally, it cannot be optimized well. Therefore, we should initialize the model to be as uniform as possible to ensure that the "extractable" information content is maximized.
Of course, in addition to ensuring that the upper bound of information content is large enough, we must also ensure that the lower bound of information content is small enough to ensure that the "extractable" information is as large as possible. Previously, when introducing contrastive learning, some readers did not understand the significance of the temperature parameter. This can also be understood through information content. Let: \begin{equation} p_i = \frac{e^{(\cos\theta_i) / \tau}}{\sum\limits_{i=1}^n e^{(\cos\theta_i)/\tau}} \end{equation} If \tau=1, the upper bound of the information entropy is \log n, but the lower bound is approximately \log n - 0.4745 (refer to the comment section). The amount of information that can be obtained is too small. Therefore, we need to decrease \tau so that the lower bound of the information entropy approaches 0, thereby increasing the amount of information that can be obtained.
In Short
I have written a brief blog post here. It can be seen that the final conclusion remains—I Heard Attention and Softmax Go Better Together...
When reprinting, please include the original address of this article: https://kexue.fm/archives/9593
For more detailed reprinting matters, please refer to: "Scientific Space FAQ"