What Should ``KL Divergence'' Look Like Under GlobalPointer? · English (unofficial) translations of posts at kexue.fm

Recently, some readers mentioned that they wanted to test the effect of combining GlobalPointer with R-Drop, but they did not know how to calculate the KL divergence under GlobalPointer. Regularization methods like R-Drop or Virtual Adversarial Training require calculating the KL divergence of probability distributions. However, the prediction result of GlobalPointer is not a probability distribution, so it cannot be calculated directly.

After some experimentation, I have come up with a usable form and verified its feasibility through simple experiments. I will introduce my analysis process here.

Symmetric Divergence

KL divergence is a function of two probability distributions. It is asymmetric, meaning KL(p\Vert q) is usually not equal to KL(q\Vert p). In practical applications, we often use the symmetrized KL divergence: D(p,q) = KL(p\Vert q) + KL(q\Vert p) Substituting the definition of KL divergence KL(p\Vert q)=\sum\limits_i p_i\log\frac{p_i}{q_i}, we can simplify it to obtain: D(p,q) = \sum_i (p_i - q_i)(\log p_i - \log q_i) Considering that p and q are usually obtained via softmax, we define: p_i = \frac{e^{s_i}}{\sum\limits_j e^{s_j}},\quad q_i = \frac{e^{t_i}}{\sum\limits_j e^{t_j}} Substituting these into the equation, we get: \begin{aligned} D(p,q) =&\, \sum_i (p_i - q_i)(s_i - t_i) + \sum_i (p_i - q_i)\left(\log\sum_j e^{t_j} - \log\sum_j e^{s_j}\right) \\ =&\, \sum_i (p_i - q_i)(s_i - t_i) + \left(\sum_i p_i - \sum_i q_i\right)\left(\log\sum_j e^{t_j} - \log\sum_j e^{s_j}\right) \\ =&\, \sum_i (p_i - q_i)(s_i - t_i) \end{aligned}\label{eq:kl-0}

Analogous Results

As we can see, from the perspective of logits, the symmetric KL divergence has the following form: D(s, t) = \sum_i (f(s_i) - f(t_i))(s_i - t_i) = \langle f(s) - f(t), s -t \rangle\label{eq:kl} where f is the softmax operation and \langle\cdot,\cdot\rangle denotes the inner product of vectors. Formally, it is the inner product of two vectors: one is the difference in logits, and the second is the difference in logits after the f transformation. What are the characteristics of the transformation f? We know that softmax is actually a smooth approximation of \text{onehot}(\text{argmax}(\cdot)) (refer to “Notes on Function Smoothing: Differentiable Approximation of Non-differentiable Functions”). For classification, the maximum value is the target class to be output, so in essence, it is a smooth approximation of “setting the target class to 1 and non-target classes to 0.”

With this abstract perspective, we can analogously construct the “KL divergence” for GlobalPointer. The output of GlobalPointer can also be understood as logits, but the loss function it uses is the multi-label cross-entropy proposed in “Generalizing ’Softmax + Cross Entropy’ to Multi-label Classification Problems”. Therefore, this is essentially a question of how to calculate KL divergence in multi-label cross-entropy. Finally, the target categories output by GlobalPointer are not the category with the largest logit, but all categories with logits greater than 0.

Therefore, for GlobalPointer, its symmetric divergence can retain the form of Equation [eq:kl], but f should be replaced with a smooth approximation of “setting values greater than 0 to 1 and values less than 0 to 0.” The sigmoid function \sigma(x)=1/(1+e^{-x}) happens to be a function that satisfies this property. Thus, we can design the symmetric KL divergence for GlobalPointer as: D(s, t) = \sum_i (\sigma(s_i) - \sigma(t_i))(s_i - t_i) = \langle \sigma(s) - \sigma(t), s -t \rangle\label{eq:gp-kl}

A Twist in the Tale

Interestingly, I later discovered that Equation [eq:gp-kl] is actually equivalent to applying \sigma activation to each logit separately, calculating the KL divergence of the binary probability for each, and then summing them up.

To prove this is simple. Note that the binary distribution [\sigma(s), 1 - \sigma(s)] constructed by the \sigma function is equivalent to the binary distribution constructed by adding softmax to the logits [s, 0], i.e., [\sigma(s), 1 - \sigma(s)] = \text{softmax}([s, 0]). Therefore, according to Equation [eq:kl-0], we directly have: \begin{aligned} &\,D\big([\sigma(s_i),1 - \sigma(s_i)],[\sigma(t_i),1 - \sigma(t_i)]\big) \\ =&\,(\sigma(s_i)-\sigma(t_i))(s_i - t_i) + \big((1-\sigma(s_i))-(1-\sigma(t_i))\big)(0 - 0)\\ =&\,(\sigma(s_i)-\sigma(t_i))(s_i - t_i) \end{aligned} Summing up each component yields Equation [eq:gp-kl].

This equivalence shows that although treating multi-label classification as multiple binary classification problems brings about class imbalance issues, if it is only used to evaluate the continuity of results, the so-called class imbalance problem does not exist (because it is not classification at all). Therefore, it can still be regarded as multiple binary classification problems, and its conventional KL divergence can be calculated.

Experimental Results

I and some fellow researchers conducted simple comparative experiments. The results showed that using Equation [eq:gp-kl] as the KL divergence to apply R-Drop to GlobalPointer can indeed slightly improve performance. However, if softmax is applied directly to the logits of GlobalPointer and then the conventional KL divergence is calculated, the results are worse. This demonstrates the rationality of Equation [eq:gp-kl].

However, it should be pointed out that Equation [eq:gp-kl] only provides a scheme for using R-Drop or Virtual Adversarial Training in GlobalPointer. Whether the performance will improve in specific cases is not guaranteed, just as R-Drop does not necessarily improve performance in conventional classification problems. This requires more experimentation, especially fine-tuning the weight coefficients of the regularization term.

Summary

This article mainly discussed the calculation of “KL divergence” under GlobalPointer, providing a usable KL divergence form for applying R-Drop or Virtual Adversarial Training to GlobalPointer.

Original Address: https://kexue.fm/archives/9039