English (unofficial) translations of posts at kexue.fm
Source

Soft-label Version of Multi-label ``Softmax + Cross Entropy''

Translated by DeepSeek V4 Pro. Translations can be inaccurate, please refer to the original post for important stuff.

(Note: The relevant content of this article has been organized into the paper “ZLPR: A Novel Loss for Multi-label Classification”. If you need to cite it, you can directly cite the English paper. Thank you.)

In the article “Generalizing Softmax + Cross Entropy to Multi-label Classification”, we proposed a loss function for multi-label classification: \log \left(1 + \sum_{i\in\Omega_{neg}} e^{s_i}\right) + \log \left(1 + \sum_{j\in\Omega_{pos}} e^{-s_j}\right) \label{eq:original} This loss function possesses the advantages of “Softmax + Cross Entropy” used in single-label classification, working effectively even when there is a significant imbalance between positive and negative classes. However, as seen from the form of this loss function, it is only applicable to “hard labels,” which means techniques like label smoothing and mixup cannot be used. This article attempts to solve this problem by proposing a soft-label version of the aforementioned loss function.

Ingenious Connection

The classic approach to multi-label classification is to transform it into multiple binary classification problems, where each category is activated by the sigmoid function \sigma(x)=1/(1+e^{-x}), and then each uses the binary cross entropy (BCE) loss. When the positive and negative categories are extremely imbalanced, the performance of this approach is usually poor, whereas the loss in Eq. [eq:original] is typically a superior choice.

In the comments section of a previous article, reader @wu.yan revealed an ingenious connection between multiple “sigmoid + BCE” losses and Eq. [eq:original]. Multiple “sigmoid + BCE” losses can be appropriately rewritten as: \begin{aligned} &\,-\sum_{j\in\Omega_{pos}}\log\sigma(s_j)-\sum_{i\in\Omega_{neg}}\log(1-\sigma(s_i))\\ =&\, \log\prod_{j\in\Omega_{pos}}(1+e^{-s_j})+\log\prod_{i\in\Omega_{neg}}(1+e^{s_i})\\ =&\, \log\left(1+\sum_{j\in\Omega_{pos}}e^{-s_j}+\dots\right)+\log\left(1+\sum_{i\in\Omega_{neg}}e^{s_i}+\dots\right) \end{aligned} \label{eq:link} Comparing this with Eq. [eq:original], we find that Eq. [eq:original] is exactly the loss of multiple “sigmoid + BCE” with the higher-order terms represented by \dots removed! When the positive and negative classes are imbalanced, these higher-order terms occupy too much weight, exacerbating the imbalance problem and leading to poor results. Conversely, removing these higher-order terms does not change the purpose of the loss function (hoping positive class scores are greater than 0 and negative class scores are less than 0). Furthermore, because the summation within the parentheses is linearly related to the number of categories, the loss gap between positive and negative classes is not too large.

Formulation Guessing

This ingenious connection tells us that to find a soft-label version of Eq. [eq:original], we can try starting from the soft-label version of multiple “sigmoid + BCE” and then attempt to remove the higher-order terms. So-called soft labels mean that labels are no longer just 0 or 1, but can be any real number between 0 and 1, representing the probability of belonging to that class. For binary cross entropy, the soft-label version is simple: -t\log\sigma(s)-(1-t)\log(1-\sigma(s)) where t is the soft label and s is the corresponding score. Mimicking the process in Eq. [eq:link], we get: \begin{aligned} &\,-\sum_i t_i\log\sigma(s_i)-\sum_i (1-t_i)\log(1-\sigma(s_i))\\ =&\, \log\prod_i(1+e^{-s_i})^{t_i}+\log\prod_i (1+e^{s_i})^{1-t_i}\\ =&\, \log\prod_i(1+t_i e^{-s_i} + \dots)+\log\prod_i (1+(1-t_i)e^{s_i}+\dots)\\ =&\, \log\left(1+\sum_i t_i e^{-s_i}+\dots\right)+\log\left(1+\sum_i(1-t_i)e^{s_i}+\dots\right) \end{aligned} If we remove the higher-order terms, we obtain: \log\left(1+\sum_i t_i e^{-s_i}\right)+\log\left(1+\sum_i(1-t_i)e^{s_i}\right) \label{eq:soft} This is the candidate form for the soft-label version of Eq. [eq:original]. It can be seen that when t_i\in\{0,1\}, it degenerates exactly into Eq. [eq:original].

Proof of Results

For now, Eq. [eq:soft] is at most a “candidate” form. To validate it, we need to prove that when t_i is a floating-point number between 0 and 1, Eq. [eq:soft] can learn meaningful results. By “meaningful,” we mean that it is theoretically possible to reconstruct the information of t_i through s_i (s_i is the model prediction, t_i is the given label, so s_i reconstructing t_i is the goal of machine learning).

To this end, we denote Eq. [eq:soft] as l and calculate the partial derivative with respect to s_i: \frac{\partial l}{\partial s_i} = \frac{-t_i e^{-s_i}}{1+\sum\limits_i t_i e^{-s_i}}+\frac{(1-t_i)e^{s_i}}{1+\sum\limits_i(1-t_i)e^{s_i}} We know that the minimum of l occurs when all \frac{\partial l}{\partial s_i} are equal to 0. Solving the system of equations \frac{\partial l}{\partial s_i}=0 directly is not easy, but I noticed a magical “coincidence”: when t_i e^{-s_i}=(1-t_i)e^{s_i}, each \frac{\partial l}{\partial s_i} automatically equals 0! Therefore, t_i e^{-s_i}=(1-t_i)e^{s_i} should be the optimal solution for l. Solving this gives: t_i = \frac{1}{1+e^{-2s_i}}=\sigma(2s_i) This is a very beautiful result, which tells us several things:

1. Eq. [eq:soft] is indeed a reasonable soft-label generalization of Eq. [eq:original]. It can completely reconstruct the information of t_i through s_i, and its form is also related to the sigmoid function.

2. If we want to output the results as probability values between 0 and 1, the correct approach should be \sigma(2s_i) rather than the intuitive \sigma(s_i).

3. Since the final probability formula also has a sigmoid form, looking at it from another perspective, it can be understood that we are still learning multiple binary classification problems with sigmoid activation, but the loss function has been replaced by Eq. [eq:soft].

Implementation Tips

The implementation of Eq. [eq:soft] can refer to the multilabel_categorical_crossentropy code in bert4keras. There is a small detail worth discussing.

First, we can equivalently rewrite Eq. [eq:soft] as: \log\left(1+\sum_i e^{-s_i + \log t_i}\right)+\log\left(1+\sum_i e^{s_i + \log (1-t_i)}\right) \label{eq:soft-log} So it seems we only need to add \log t_i to -s_i, add \log(1-t_i) to s_i, and then perform a standard logsumexp after padding with zero. However, in practice, t_i can take values of 0 or 1, making \log t_i or \log(1-t_i) negative infinity. Since frameworks cannot handle negative infinity directly, we usually need to clip before the \log, i.e., given \epsilon > 0: \text{clip}(t)=\begin{cases} \epsilon, & t < \epsilon \\ t, & \epsilon \leq t \leq 1-\epsilon \\ 1-\epsilon, & t > 1-\epsilon \end{cases}

But this clipping introduces a problem. Since \epsilon is not truly infinitesimal (e.g., \epsilon=10^{-7}), \log\epsilon is approximately -16. In scenarios like GlobalPointer, we pre-mask invalid s_i by setting them to a very large negative number, such as -10^7. Looking at Eq. [eq:soft-log], the summation in the first term involves e^{-s_i + \log t_i}, so -10^7 becomes 10^7. If t_i were not clipped, \log t_i would be \log 0 = -\infty, which could turn -s_i + \log t_i back to negative infinity. However, as we just saw, the clipped \log t_i is at most -16, which is far smaller than the 10^7 from -s_i. Thus, -s_i + \log t_i remains a large positive number.

To solve this, we must not only clip t_i, but also identify t_i values that were originally less than \epsilon and manually set the corresponding -s_i to a very large negative number. Similarly, for t_i greater than 1-\epsilon, we set the corresponding s_i to a very large negative number. This treats values less than \epsilon as exactly 0 and values greater than 1-\epsilon as exactly 1.

Summary

This article primarily generalizes the previously proposed multi-label “Softmax + Cross Entropy” to soft-label scenarios. With the corresponding soft-label version, we can combine it with techniques like label smoothing and mixup. For models like GlobalPointer, this provides another direction for optimization.

Original Address: https://kexue.fm/archives/9064

For more details on reposting, please refer to: “Scientific Space FAQ”