An Unsuccessful Attempt: Generalizing Multi-label Cross-entropy to ``$n$ sets of $m$-class classification'' · English (unofficial) translations of posts at kexue.fm

Some readers may have noticed that this update has been delayed for a relatively long time. In fact, I started preparing this article last weekend. However, I underestimated the difficulty of this problem; after deriving for nearly an entire week, I still haven’t obtained a perfect result. What is being published now is still a failed attempt, and I hope experienced readers can provide some guidance.

In the article “Generalizing Softmax + Cross-Entropy to Multi-Label Classification”, we proposed a multi-label classification loss function that can automatically adjust for the imbalance between positive and negative classes. Later, in “A Soft-Label Version of Multi-Label Softmax + Cross-Entropy”, we further derived its “soft-label” version. Essentially, multi-label classification is a problem of “n sets of 2-class classification.” Correspondingly, what should the loss function for “n sets of m-class classification” look like?

This is the question explored in this article.

Analogy Attempt

In the article on soft-label generalization “A Soft-Label Version of Multi-Label Softmax + Cross-Entropy”, we obtained the final result by directly applying a first-order truncation inside the \log of the sigmoid cross-entropy loss for “n sets of 2-class classification.” The same process can indeed be generalized to the softmax cross-entropy loss for “n sets of m-class classification.” This was my first attempt.

Let \text{softmax}(s_{i,j}) = \frac{e^{s_{i,j}}}{\sum\limits_j e^{s_{i,j}}}, where s_{i,j} is the prediction and t_{i,j} is the label. Then: \begin{aligned} -\sum_i\sum_j t_{i,j}\log \text{softmax}(s_{i,j}) =& \sum_i\sum_j t_{i,j}\log \left(1 + \sum_{k\neq j} e^{s_{i,k} - s_{i,j}}\right)\\ =& \sum_j \log \prod_i\left(1 + \sum_{k\neq j} e^{s_{i,k} - s_{i,j}}\right)^{t_{i,j}}\\ =& \sum_j \log \left(1 + \sum_i t_{i,j}\sum_{k\neq j} e^{s_{i,k} - s_{i,j}}+\cdots\right)\\ \end{aligned} The summation over i defaults to 1 \sim n, and the summation over j defaults to 1 \sim m. Truncating the higher-order terms \cdots, we get: l = \sum_j \log \left(1 + \sum_{i,k\neq j} t_{i,j}e^{- s_{i,j} + s_{i,k}}\right) \label{eq:loss-1} This is the loss I initially obtained, which is a natural generalization of the previous results to “n sets of m-class classification.” In fact, if t_{i,j} are hard labels, this loss is basically fine. However, I hoped that, like in “A Soft-Label Version of Multi-Label Softmax + Cross-Entropy”, an analytical solution could be derived for soft labels as well. To this end, I took its derivative: \frac{\partial l}{\partial s_{i,j}} = \frac{- t_{i,j}e^{- s_{i,j}}\sum\limits_{k\neq j} e^{s_{i,k}}}{1 + \sum\limits_{i,k\neq j} t_{i,j}e^{- s_{i,j} + s_{i,k}}} + \sum_{h\neq j} \frac{t_{i,h}e^{- s_{i,h}}e^{s_{i,j}}}{1 + \sum\limits_{i,k\neq h} t_{i,h}e^{- s_{i,h} + s_{i,k}}} The so-called analytical solution is found by solving the equation \frac{\partial l}{\partial s_{i,j}}=0. However, after trying for several days, I could not find a solution to the equation. I suspect there is no simple explicit solution. Therefore, the first attempt failed.

Reverse Engineering from Results

After trying for a few days without success, I thought in reverse: Since the result derived by direct analogy cannot be solved, I might as well reverse-engineer it from the desired result—that is, first determine the solution and then deduce what the equation should be. Thus, I began my second attempt.

First, I observed that the original multi-label loss, or the loss \eqref{eq:loss-1} obtained earlier, both have the following form: l = \sum_j \log \left(1 + \sum_i t_{i,j}e^{- f(s_{i,j})}\right) \label{eq:loss-2} Using this form as a starting point, we take the derivative: \frac{\partial l}{\partial s_{i,k}} = \sum_j \frac{- t_{i,j}e^{- f(s_{i,j})}\frac{\partial f(s_{i,j})}{\partial s_{i,k}}}{1 + \sum\limits_i t_{i,j}e^{- f(s_{i,j})}} We hope that t_{i,j}=\text{softmax}(f(s_{i,j}))=e^{f(s_{i,j})}/Z_i is the analytical solution to \frac{\partial l}{\partial s_{i,k}}=0, where Z_i=\sum\limits_j e^{f(s_{i,j})}. Substituting this in, we get: 0=\frac{\partial l}{\partial s_{i,k}} = \sum_j \frac{- (1/Z_i)\frac{\partial f(s_{i,j})}{\partial s_{i,k}}}{1 + \sum\limits_i 1/Z_i} = \frac{- (1/Z_i)\frac{\partial \left(\sum\limits_j f(s_{i,j})\right)}{\partial s_{i,k}}}{1 + \sum\limits_i 1/Z_i} To make the above equation hold naturally, we find that we only need to make \sum\limits_j f(s_{i,j}) equal to a constant independent of i and j. For simplicity, let: f(s_{i,j})=s_{i,j}- \bar{s}_i,\qquad \bar{s}_i=\frac{1}{m}\sum_j s_{i,j} This naturally gives \sum\limits_j f(s_{i,j})=0. The corresponding optimization objective is: l = \sum_j \log \left(1 + \sum_i t_{i,j}e^{- s_{i,j} + \bar{s}_i}\right) \label{eq:loss-3} Since \bar{s}_i does not affect the normalization result, its theoretical optimal solution is t_{i,j}=\text{softmax}(s_{i,j}).

However, while it looks promising, its actual performance is quite poor. Although t_{i,j}=\text{softmax}(s_{i,j}) is indeed the theoretical optimal solution, in practice, the closer the labels are to hard labels, the worse the performance becomes. This is because for the loss \eqref{eq:loss-3}, as long as s_{i,j} \gg \bar{s}_i, the loss will be very close to 0. To achieve s_{i,j} \gg \bar{s}_i, s_{i,j} does not necessarily have to be the maximum among s_{i,1}, s_{i,2}, \dots, s_{i,m}, which fails to achieve the classification goal.

Thinking and Analysis

We now have two results. Equation \eqref{eq:loss-1} is an analogical generalization of the original multi-label cross-entropy; it performs well with hard labels, but since the analytical solution for soft labels cannot be found, the soft-label case cannot be theoretically evaluated. Equation \eqref{eq:loss-3} is reverse-engineered from the theoretical result; while its analytical solution is a simple softmax, due to the limitations of actual optimization algorithms, its performance on hard labels is usually poor, and it cannot even guarantee that the target logits are the maximum values. Notably, when m=2, both Equation \eqref{eq:loss-1} and Equation \eqref{eq:loss-3} degenerate into the standard multi-label cross-entropy.

We know that multi-label cross-entropy can automatically adjust for the problem of positive and negative sample imbalance. Similarly, although we haven’t yet obtained a perfect generalization, theoretically, after generalizing to “n sets of m-class classification,” it should still be able to automatically adjust for the imbalance among the m classes. What is the mechanism of this balance? It is not difficult to understand. Whether it is the analogical generalization in Eq. \eqref{eq:loss-1} or the general hypothesis in Eq. \eqref{eq:loss-2}, the summation over i is placed inside the \log. Originally, the loss contribution of each class was roughly proportional to the number of samples in that class. By moving the summation inside the \log, the loss contribution of each class becomes roughly equal to the logarithm of the number of samples in that class, thereby narrowing the loss gap between classes and automatically alleviating the imbalance problem.

Regrettably, this article has not yet reached a perfect generalization for “n sets of m-class classification”—which should possess two characteristics: 1. Automatically adjusting for class imbalance via the \log method; 2. Having an analytical solution for the soft-label case. For hard labels, using Eq. \eqref{eq:loss-1} directly should be sufficient; however, for soft labels, I am truly at a loss. I welcome interested readers to think and communicate together.

Summary

This article attempted to generalize the previous multi-label cross-entropy to “n sets of m-class classification.” Unfortunately, this generalization was not successful. I am sharing the results here for now, hoping that interested readers can participate in improving them.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/9158

For more details on reprinting, please refer to: “Scientific Space FAQ”