A Simple Solution to Mitigate Overconfidence in Cross-Entropy · English (unofficial) translations of posts at kexue.fm

It is well known that the standard evaluation metric for classification problems is accuracy, while the standard loss function is cross-entropy. Cross-entropy has the advantage of fast convergence, but it is not a smooth approximation of accuracy, which leads to an inconsistency between training and prediction. On the other hand, when the predicted probability of a training sample is very low, cross-entropy yields a massive loss (tending towards -\log 0^{+} = \infty). This means that cross-entropy pays excessive attention to samples with low predicted probabilities—even if those samples might be “noisy data.” Consequently, models trained with cross-entropy often exhibit overconfidence, where the model assigns high predicted probabilities to every sample. This brings two side effects: first, a decrease in performance due to overfitting on noisy data; second, the predicted probability values cannot serve as a reliable indicator of uncertainty.

Regarding improvements to cross-entropy, the academic community has seen a continuous output of research. Currently, this field remains in a state where “diverse approaches compete,” and there is no single standard answer. In this article, we will explore another simple candidate solution for this problem provided by the paper “Tailoring Language Generation Models under Total Variation Distance.”

Introduction to the Results

As the name suggests, the modifications in the original paper are targeted at text generation tasks, with a theoretical foundation based on Total Variation distance (refer to “Designing GANs: Another GAN Production Workshop”). However, after a series of relaxations and simplifications in the original paper, the final result no longer has a significant connection to Total Variation distance and is theoretically not limited to text generation tasks. Therefore, this article treats it as a loss function for general classification tasks.

For a data pair (x, y), the loss function given by cross-entropy is: -\log p_{\theta}(y|x) The modification in the original paper is very simple, changing it to: -\frac{\log \big[\gamma + (1 - \gamma)p_{\theta}(y|x)\big]}{1-\gamma} \label{eq:gamma-ce} where \gamma \in [0, 1]. When \gamma=0, it is ordinary cross-entropy; when \gamma=1, calculated by the limit, the result is -p_{\theta}(y|x).

In the experiments of the original paper, the selection of \gamma varied significantly across different tasks. For example, \gamma=10^{-7} was used for language modeling tasks, \gamma=0.1 for machine translation, and \gamma=0.8 for text summarization. A rule of thumb is that if training from scratch, a \gamma closer to 0 should be chosen; if fine-tuning, a relatively larger \gamma can be considered. Additionally, there is a more intuitive approach: treating \gamma as a dynamic parameter, starting from \gamma=0 and gradually shifting towards \gamma=1 as training progresses, though this adds another schedule to tune.

In terms of effectiveness, since there is an adjustable \gamma parameter and the original cross-entropy is included as a special case, as long as one puts effort into tuning, there is generally a good chance of achieving better results than cross-entropy. This is not a major concern.

Personal Derivation

How should we understand Equation [eq:gamma-ce]? In the section on “Accuracy” in “Notes on Function Smoothing: Differentiable Approximations of Non-differentiable Functions,” we derived that a smooth approximation of accuracy is: \mathbb{E}_{(x,y)\sim \mathcal{D}}[p_{\theta}(y|x)] Therefore, if our evaluation metric is accuracy, it intuitively seems that -p_{\theta}(y|x) should be the loss function, because in this case, the change in the loss function is more synchronized with the change in accuracy. However, in practice, cross-entropy often performs better. But the starting point of cross-entropy is merely to be “easier to train,” so it sometimes results in “over-training,” leading to overfitting. Thus, an intuitive idea is whether we can “interpolate” the two results to balance their respective advantages.

To this end, let us consider the gradients of both (where accuracy refers to its negative smooth approximation -p_{\theta}(y|x)): \begin{aligned} \text{Accuracy: } & \quad -\nabla_{\theta} p_{\theta}(y|x) \\ \text{Cross-Entropy: } & \quad -\frac{1}{p_{\theta}(y|x)}\nabla_{\theta} p_{\theta}(y|x) \end{aligned} The difference between the two is just the factor \frac{1}{p_{\theta}(y|x)}. How can we change \frac{1}{p_{\theta}(y|x)} to 1? The solution in the original paper is: \frac{1}{\gamma + (1 - \gamma)p_{\theta}(y|x)} Of course, this construction is not unique. The method chosen by the original paper preserves the gradient characteristics of cross-entropy as much as possible, thereby retaining the fast convergence property of cross-entropy. Based on this construction, we want the gradient of the new loss function to be: -\frac{\nabla_{\theta}p_{\theta}(y|x)}{\gamma + (1 - \gamma)p_{\theta}(y|x)} = \nabla_{\theta}\left(-\frac{\log \big[\gamma + (1 - \gamma)p_{\theta}(y|x)\big]}{1-\gamma}\right) \label{eq:gamma-ce-g} This identifies the loss function in Equation [eq:gamma-ce]. In this process, we first designed the new gradient and then found the corresponding loss function by integrating to find the primitive function.

Further Discussion

Why design the loss function from the perspective of the gradient? There are roughly two reasons.

First, many loss functions simplify significantly after taking the gradient. Therefore, designing in the gradient space often provides more inspiration and freedom. For example, in this case, designing the transition function \frac{1}{\gamma + (1 - \gamma)p_{\theta}(y|x)} between \frac{1}{p_{\theta}(y|x)} and 1 in the gradient space is not too complex. However, directly designing a transition function between p_{\theta}(y|x) and \log p_{\theta}(y|x) in the loss function space, such as \frac{\log [\gamma + (1 - \gamma)p_{\theta}(y|x)]}{1-\gamma}, would be much more complicated.

Second, the optimizers currently in use are all gradient-based, so in many cases, we only need to design the gradient; it is not even necessary to find the primitive function. The original result of the paper actually only provided the gradient: -\max\left(b, \frac{p_{\theta}(y|x)}{\gamma + (1 - \gamma)p_{\theta}(y|x)}\right)\nabla_{\theta}\log p_{\theta}(y|x) When b=0, it is equivalent to Equation [eq:gamma-ce]. That is to say, the original paper also added a threshold when designing the gradient, at which point it becomes difficult to write a simple primitive function. However, implementing the above is not difficult; one only needs to consider the loss function: -\max\left(b, \frac{p_{\theta}(y|x)}{\gamma + (1 - \gamma)p_{\theta}(y|x)}\right)_{\text{stop\_grad}}\log p_{\theta}(y|x) Here, \text{stop\_grad} means directly cutting off the gradient of this part of the result, which corresponds to the tf.stop_gradient operator in TensorFlow.

Summary

This article introduced a simple scheme to mitigate the overconfidence of cross-entropy.